Part 1: Loading and reviewing data

Beginning the analysis, I’ve loaded the project data and relevant R packages. Additionally, I developed a custom function, custom_kable, that is useful for constructing aesthetically pleasing charts. Further, I also set my ggplot2 theme to bw, which is my preference. With that done, I have all the building blocks to begin the project.

library(dplyr)
library(ggplot2)
library(ggfortify)
library(GGally)
library(reshape2)
library(corrplot)
library(zoo)
library(rgl)
library(knitr)
library(kableExtra)

treasury_data <- read.csv("regression_final_project.csv")

#setting ggplot preference
theme_set(
  theme_bw()
)

#creating custom table function
custom_kable <- function(x){
  kable(x, format = "html") %>%
  kable_styling(bootstrap_options = "striped")
}

To start, I did a quick review of the data. As in the chart below, there are 11 variables, most of which numeric. These are interest rates for various US Treasury yields to maturity, which range from 3 months to 30 years. The other variables are for dates (i.e. the treasury bond interest rate on that day) in addition to categorical markers indicating if the time period was one where the Fed was easing or tightening monetary policies. Finally, there is an anonymized output variable, which serves as the project’s outcome focus. Right now, it’s unclear what this variable entails but, the project will work towards.

data.frame(
  variable = names(treasury_data),
  data.type = sapply(treasury_data, typeof),
  values = sapply(treasury_data, function(x) paste0(head(x),  collapse = ", ")),
  row.names = NULL) %>%
  custom_kable()
variable data.type values
Date integer 1/5/1981, 1/6/1981, 1/7/1981, 1/8/1981, 1/9/1981, 1/12/1981
USGG3M double 13.52, 13.58, 14.5, 14.76, 15.2, 15.22
USGG6M double 13.09, 13.16, 13.9, 14, 14.3, 14.23
USGG2YR double 12.289, 12.429, 12.929, 13.099, 13.539, 13.179
USGG3YR double 12.28, 12.31, 12.78, 12.95, 13.28, 12.94
USGG5YR double 12.294, 12.214, 12.614, 12.684, 12.884, 12.714
USGG10YR double 12.152, 12.112, 12.382, 12.352, 12.572, 12.452
USGG30YR double 11.672, 11.672, 11.892, 11.912, 12.132, 12.082
Output1 double 18.01552579, 18.09139801, 19.44731416, 19.74851024, 20.57204231, 20.1421849
Easing integer NA, NA, NA, NA, NA, NA
Tightening integer NA, NA, NA, NA, NA, NA

As another quick data glance, the first 6 rows of each variable can be seen below. This further reinforces that both Easing and Tightening have numerous NA’s.

head(treasury_data) %>%
  custom_kable
Date USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR Output1 Easing Tightening
1/5/1981 13.52 13.09 12.289 12.28 12.294 12.152 11.672 18.01553 NA NA
1/6/1981 13.58 13.16 12.429 12.31 12.214 12.112 11.672 18.09140 NA NA
1/7/1981 14.50 13.90 12.929 12.78 12.614 12.382 11.892 19.44731 NA NA
1/8/1981 14.76 14.00 13.099 12.95 12.684 12.352 11.912 19.74851 NA NA
1/9/1981 15.20 14.30 13.539 13.28 12.884 12.572 12.132 20.57204 NA NA
1/12/1981 15.22 14.23 13.179 12.94 12.714 12.452 12.082 20.14218 NA NA

Following the first data review, I am going to make one transformation. Here, I’m simply changing Date from a factor to a formal date.

treasury_data <- treasury_data %>%
  mutate(Date = as.Date(Date, format = "%m/%d/%Y"))

Moving into some analysis, I’ve plotted all the treasury yields from their first to last date. The chart highlights that these yields have steadily declined across all categories since 1981. It’s interesting to review the plot and match the lines to historical financial issues, such as the 2008 market crash stemming from the sub-prime mortgage issues. The general cyclical nature of the economy can be seen from the graphic. As a note, there is a gap in the data in 2007 that accounts for the

bond_trend <- treasury_data %>%
  select(-Easing, -Tightening) %>%
  melt(value.name = "interest.rate", 
       variable.name = "treasury.yield",
       id.vars = "Date")

bond_trend %>%
  filter(treasury.yield != "Output1") %>%
  ggplot(aes(Date, interest.rate, colour = treasury.yield)) +
  geom_line(size = 1.3, alpha = .3) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  geom_vline(xintercept = as.Date(c("2007-11-29", "2008-11-11")),
             colour = "royalblue2", size = 1.3, alpha = .5) +
  annotate("rect", xmin = as.Date("2007-11-29"), xmax = as.Date("2008-11-11"),
           ymin = 0, ymax = max(treasury_data$USGG3M), 
           alpha = .1, fill = "dodgerblue2") +
  scale_y_continuous(breaks = seq(0, 20, 2)) +
  labs(title = "Interest rates have declined steadily from 1981 to 2014 in all bond categories",
       subtitle = "Vertical blue lines from 2007 to 2008 denote a gap in yield rates as a result of missing data",
       caption = "Source: Course Project Treasury Data")

As an added feature, I’ve also include a summary table for each decade. I developed the decade feature using Date and then grouped it using the melted data frame from the plot above. Using this, the aggregate trend for each decade can be seen. I’ve included a range of statistics here but, the table is sorted by yield mean. As gleaned from the chart, these yield rates have steadily declined since the 1980s. Interestingly, the 1980s also had the largest range from about 17 to 5; this spread can be further seen given the decade has the highest standard deviation. The aforementioned 2008 financial crisis is evident again where the yields dropped slightly under zero on some dates.

bond_trend %>%
  mutate(decade.marker = substr(as.character(Date), 1, 3),
         decade = case_when(
           decade.marker == "198" ~ "1980s",
           decade.marker == "199" ~ "1990s",
           decade.marker == "200" ~ "2000s",
           decade.marker == "201" ~ "2010s")) %>%
  filter(treasury.yield != "Output1") %>%
  group_by(decade) %>%
  summarise(yield.mean = mean(interest.rate),
            yield.median = median(interest.rate),
            yield.sd = sd(interest.rate),
            yield.min = min(interest.rate),
            yield.max = max(interest.rate)) %>%
  arrange(desc(yield.mean)) %>%
  custom_kable()
decade yield.mean yield.median yield.sd yield.min yield.max
1980s 9.837335 9.19800 2.578732 5.1210 17.0100
1990s 5.972553 5.87700 1.326211 2.6480 9.1720
2000s 3.747912 4.17980 1.639327 -0.0410 6.9080
2010s 1.259734 0.68025 1.324698 -0.0152 4.8395

The plots can also be viewed using facets, which makes each yield period clearer.

bond_trend %>%
  filter(treasury.yield != "Output1") %>%
  ggplot(aes(Date, interest.rate, colour = treasury.yield)) +
  geom_line(size = 1.3) +
  facet_wrap(facets = "treasury.yield", nrow = 2) +
  theme(legend.position = "none") +
  labs(title = "Interest rates have declined steadily from 1981 to 2014 in all bond categories",
       caption = "Source: Course Project Treasury Data")

So far, the plots have only included the treasury interest rates for various bond types. However, the yet unknown outcome variable hasn’t been included alongside. To see how the predictor variables interact with the output, the plot below includes all trajectories. As seen, the outcome variable has been more erratic than the predictors. Despite this, at first glance it generally seems highly correlated with the input variables despite some scale differences. Of note, the variable drops well below zero on many dates, including most of the 2000s and all of the 2010s.

bond_trend %>%
  ggplot(aes(Date, interest.rate, colour = treasury.yield)) +
  geom_line(size = 1.3, alpha = .4) +
  scale_y_continuous(breaks = seq(-20, 30, 5)) +
  geom_hline(yintercept = 0, colour = "royalblue2", size = 1.3, alpha = .3) +
  guides(colour = guide_legend(override.aes = list(size = 2))) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  labs(title = "Interest rates have declined steadily from 1981 to 2014 in all bond categories",
       subtitle = "Output1 has been more volatile than predictors in given timeframe -- variable identity still unknown",
       caption = "Source: Course Project Treasury Data",
       y = "interest rate (and unknown output values)")

The table below is similar to the yields for predictors but, the output extremes are evident. The values are much higher than the yields in the 1980s but, drop well below starting in the early 1990s. While the variable remains shrouded in mystery, the plot and table highlight some attributes of how it behaves. Again, despite some scaling variation, the output and inputs seem strongly correlated.

bond_trend %>%
  mutate(decade.marker = substr(as.character(Date), 1, 3),
         decade = case_when(
           decade.marker == "198" ~ "1980s",
           decade.marker == "199" ~ "1990s",
           decade.marker == "200" ~ "2000s",
           decade.marker == "201" ~ "2010s")) %>%
  filter(treasury.yield == "Output1") %>%
  group_by(decade) %>%
  summarise(output.mean = mean(interest.rate),
            output.median = median(interest.rate),
            output.sd = sd(interest.rate),
            output.min = min(interest.rate),
            output.max = max(interest.rate)) %>%
  arrange(desc(output.mean)) %>%
  custom_kable()
decade output.mean output.median output.sd output.min output.max
1980s 10.9393048 9.3803598 6.4622746 1.227457 27.297880
1990s 0.7140675 0.4043963 2.7735960 -4.166962 8.384047
2000s -5.1761605 -5.3571968 3.5944082 -12.208100 2.422812
2010s -11.8650363 -11.8838808 0.8610511 -13.173111 -9.634017

To check the formal relationship between the inputs and output before getting into the linear modelling, I constructed a correlation visualization. As seen, the correlations are extremely high. In fact, they’re perfect in some cases, as with USGG2YR to USGG5YR, and almost perfect in the remaining few. This is a strong indication that the output variable might reasonably be some derivation of the inputs. That said, what that might be is still very unclear and there’s more work to uncover what the output actually is.

corplot_df <- treasury_data %>%
  select(USGG3M, USGG6M, USGG2YR, USGG3YR, USGG5YR, USGG10YR, USGG30YR, Output1) %>%
  rename(three.m = USGG3M,
         six.m = USGG6M,
         two.yr = USGG2YR,
         three.yr = USGG3YR, 
         five.yr = USGG5YR,
         ten.yr = USGG10YR,
         thirty.yr = USGG30YR,
         output = Output1)

corplot_df <- cor(corplot_df)

corrplot.mixed(corplot_df,
               title = "Correlation plot for Treasury Data- Output shows almost perfect correlation to inputs", mar = c(0, 0, 2, 0))


Part 2: Linear modelling

Estimate simple regression models with each of the input variables and the output variable.

Following the initial exploratory work, I’ll being the modelling phase. To begin, I’ll be developing seven individual linear models, one for each treasury yield category versus the output.

linear3M <- lm(Output1 ~ USGG3M, data = treasury_data)

linear6M <- lm(Output1 ~ USGG6M, data = treasury_data)

linear2YR <- lm(Output1 ~ USGG2YR, data = treasury_data)

linear3YR <- lm(Output1 ~ USGG3YR, data = treasury_data)

linear5YR <- lm(Output1 ~ USGG5YR, data = treasury_data)

linear10YR <- lm(Output1 ~ USGG10YR, data = treasury_data)

linear30YR <- lm(Output1 ~ USGG30YR, data = treasury_data)

Analyze the regession summaries

Starting with the first model, which focuses on the 3-month yield, it’s clear that the simple linear regressions should be very significant. There were clear signs this might be the case during the correlation work but, the model summary helps solidify this intuition. Here, the model as a whole has a very significant slope indicating the alternative hypothesis, that there is a relationship between input vs output variable, can be accepted.

The slope estimate is 2.50756, which suggests that for every one unit increase in the output, the input goes up by that amount. In real terms here, that means for one unit change in output, the bond rate would be expected to go up by about 2.5 points. The -11.72318 intercept is less interesting in any pragmatic sense given interest rates cannot go below zero in the US and the output is unknown. Nonetheless, this indicates that without any interaction, the output level would be at that value. Both the slope and intercept are significant as well. Broadly speaking, these mean that there is a verifiable relationship between the 3-month bond interest rate and the output. Assuming the output is of some significance to a business or government audience, this relationship would be very useful for understanding market trends or possibly forecasting. That said, with the output still unknown, it’s hard to draw many conclusions about the wider applicability of the model. Still, there are signs that this might possibly have fruitful applications.

summary(linear3M)
## 
## Call:
## lm(formula = Output1 ~ USGG3M, data = treasury_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9374 -1.2115 -0.0528  1.2640  7.7048 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.72318    0.03137  -373.7   <2e-16 ***
## USGG3M        2.50756    0.00541   463.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.69 on 8298 degrees of freedom
## Multiple R-squared:  0.9628, Adjusted R-squared:  0.9628 
## F-statistic: 2.148e+05 on 1 and 8298 DF,  p-value: < 2.2e-16

To better review this relationship, I’ve included a scatterplot for the variables. This significant association is on display here with a very strong positive relationship. Further accentuating the association is a regression line. Of note, I’ve included it with standard error, although there is so little that it isn’t visible.

Perhaps more interestingly, the points show an uneven, clustered pattern. This is a clue that the relationship is different depending on the value of each variable. Additionally, this almost guarantees that the residuals stemming from the regression are unequal since points in the top have a different pattern than in the bottom portions. In fact, many sections seem to be very uneven. Again, I feel like this points to a non-traditional “output” and towards one that is in someway linked to the interest rate predictors (if not wholly derived from them).

treasury_data %>%
  ggplot(aes(USGG3M, Output1)) +
  geom_jitter(colour = "darkgray", alpha = .5) +
  geom_smooth(method = "lm", size = 1.3) +
  labs(title = "Output1 vs USGG3M interest rate- variables have a very significant relationship",
       subtitle = "Scatteprlot pattern displays clustered points which are unevenly distributed  ",
       caption = "Source: Course Project Treasury Data")

Check relevance of the estimated parameters and the model as a whole, including the amount of correlation explained.

The plot raises another point. Given that the models are likely quite homogeneous, owing to similarly high correlations coefficients and plot trends, I decided to review them in unison. This helps highlight model comparisons while centralizing many model coefficients. Additionally, I wanted to try making a few custom functions to capture the various model outputs as well. To this end, I started with collecting all the model variance.

This is a crucial step in the analysis. Collecting and reviewing how much variance a model explains is central to inferential statistics. To do this, there are a few variance related calculations that need to be made. First, I started with total variance. This is captured by taking the variance of Output1. In general terms, this measures how spread out each value from the output variable is from its average value. Using regression, the idea is to explain how much variance can be accounted for by the predictor.

This leads to the second calculation, which is unexplained variance. With a simple regression model, this means the amount of variance that the predictor variable does not explain. To derive this value, I squared the model’s sigma, or variance. This removes any negative values and generally normalizes the values. The resulting value is the unexplained variance, which can then be compared against the total variance to assess model fit. If the model explains the output variance well, the unexplained variance should be low. Overall, this is important because it situates how well the model explains the output variable, which is ultimately one of the goals of regression.

To automate this process across all the seven models, I utilized sapply to functionally run the models and collect the unexplained variance at once. Further, I captured them in a data frame so they could be presented nicely while also adding in a new variable which is the total percent explained by the model. This is a manual R2, which can be checked for accuracy against the normal model summary to make sure the calculations are correct.

The output of the custom function can be seen below. It shows that all of the models explain a very high amount of the output. The model I started with, the 3-month bond, is on the lower side of variance explained, although that’s extremely relative here given an R2 of about .96. The lowest model here is the 30-year bond with an R2 of about .935 while the 3-year bond is highest with an absurd .998 R2. This table really highlights how well each predictor explains the output variable. They lend further credence to the output being some amalgamation of the inputs as well though given some of the models explain all variance in Output1.

variance_explained <- function(columns){
  var_df <- data.frame(
    Model = colnames(treasury_data[,columns]),
    Total.Variance = var(treasury_data$Output1),
    Unexplained.Variance = sapply(columns, function(x) summary(lm(Output1 ~ treasury_data[,x], data = treasury_data))$sigma ^ 2))
  
  var_df %>%
  mutate(Percent.Explained.R2 = (Total.Variance - Unexplained.Variance) / Total.Variance * 100) %>% custom_kable()
}

variance_explained(2:8)
Model Total.Variance Unexplained.Variance Percent.Explained.R2
USGG3M 76.80444 2.8570577 96.28009
USGG6M 76.80444 1.9673208 97.43853
USGG2YR 76.80444 0.2588092 99.66303
USGG3YR 76.80444 0.1596570 99.79213
USGG5YR 76.80444 0.6382572 99.16898
USGG10YR 76.80444 2.3667523 96.91847
USGG30YR 76.80444 4.9672862 93.53255

Much like the previous custom function deriving variance explanations for each model, I put together one to review model significance (albeit with apply). Through this, I’m checking to see if the overall model slope is significantly different than zero (as would be the null hypothesis). Not surprisingly considering the previous work, all the models are highly significant. In fact, their respective p-values are so small, they register as zero. Alongside the model p-values, I’ve included the F statistic and degrees of freedom for model and residuals. These are used to derive the p-value in the F test, which I’ve included in the function using pf and the model appropriate components.

model_significance <-  function(columns){
  sig_df <- data.frame(
    model = colnames(treasury_data[,columns]),
  t(apply(treasury_data[,columns], 2, function(x) summary(lm(Output1 ~ x, treasury_data))$fstatistic)))
  
  sig_df %>%
    rename(f.statistic = value) %>%
    mutate(model.p.value = pf(f.statistic, numdf, dendf, lower.tail = F)) %>%
    custom_kable()
}

model_significance(2:8)
model f.statistic numdf dendf model.p.value
USGG3M 214798.7 1 8298 0
USGG6M 315695.9 1 8298 0
USGG2YR 2454520.2 1 8298 0
USGG3YR 3984011.3 1 8298 0
USGG5YR 990359.0 1 8298 0
USGG10YR 261016.2 1 8298 0
USGG30YR 120021.6 1 8298 0

Collect all slopes and intercepts in one table and print this table. Try to do it in one line using apply function.

For my previous functions I’ve used both sapply and apply, which will continue here for collecting all the model slopes and intercepts. Per the question though, I’ve utilized apply as part of a larger function to collect and display model coefficients.

Having already covered the slope and intercept for the 3-month bond, all the interpretations here are similar, save for the values themselves. All the models have similar slopes ranging from 2.4 (USGG2YR) to about 3.1 (USGG30YR). This means that for every one unit increase in the output, the expected interest rate would increase by this value. The intercepts vary more widely from -11.72318 (USGG3M) to -21.08590 (USGG30YR). In combination with the other slopes, this means that each line has slightly different angles, despite being so similar.

Again, all the slopes and intercepts are very significant, as seen by zeroes in both columns. These p-values are derived using the T distribution test, hence the inclusion of their respective t-values (all of which are, naturally, very high).

input_coefficients <- function(columns){
  input_df <- data.frame(
  Model = colnames(treasury_data[,columns]),
  t(apply(treasury_data[,columns], 2, function(x) summary(lm(Output1 ~ x, treasury_data))$coefficients[c(1, 2, 5, 6, 7, 8)])))
  
  rownames(input_df) <- NULL
  
  input_df %>%
    rename(Intercept = X1,
           Slope = X2,
           Intercept.t.value = X3,
           Slope.t.value = X4,
           Intercept.p.value = X5,
           Slope.p.value = X6) %>% custom_kable()
}

(input_coef <- input_coefficients(2:8))
Model Intercept Slope Intercept.t.value Slope.t.value Intercept.p.value Slope.p.value
USGG3M -11.72318 2.507561 -373.7126 463.4638 0 0
USGG6M -12.09753 2.497235 -457.0457 561.8683 0 0
USGG2YR -13.05577 2.400449 -1301.5071 1566.6908 0 0
USGG3YR -13.86162 2.455793 -1687.6241 1995.9988 0 0
USGG5YR -15.43665 2.568742 -866.3143 995.1678 0 0
USGG10YR -18.06337 2.786991 -461.0150 510.8975 0 0
USGG30YR -21.08590 3.069561 -321.4475 346.4413 0 0

The last measure to review from each model is adjusted R2. While R2 was discussed, it’s worth spending time on the adjusted version as well. Adjusted R2 still accounts for how much variance a model explains but, with a slight formula difference. Inherently, adding new predictors explains more variance in a model. However, it might not add any new worthwhile information. To offset adding new but ultimately less useful predictors and inflating R2, adjusted R2 provides a “penalty” and can be lowered as a result. Ideally, the model will be equal on both metrics- if adjusted is lower, it’s a sign that there is redundant or less than ideal predictors included. In all the model’s here, the adjusted R2 matches the original variance explained numbers. Even though these are one predictor models, seeing them perfectly aligned signals there isn’t any excess or redundant information.

adj_rsquared <- data.frame(
  lapply(treasury_data[,2:8], 
         function(x) summary(lm(Output1 ~ x, data = treasury_data))[9])
)

names(adj_rsquared) <- names(treasury_data[,2:8])

rownames(adj_rsquared) <- "adj.r.squared"

custom_kable(t(adj_rsquared))
adj.r.squared
USGG3M 0.9628009
USGG6M 0.9743853
USGG2YR 0.9966303
USGG3YR 0.9979213
USGG5YR 0.9916898
USGG10YR 0.9691847
USGG30YR 0.9353255

Plot the output variable together with the fitted values

Before plotting the fitted values, it’s worth doing a final review of each linear regression model but, from a visual standpoint. Previously, I plotted USGG3M so I thought it was worthwhile to revisit all the bivariate relationships. As seen, they all have the same uneven distributions. This lends visual confirmation to the .998 R2 for USGG3YR as well- the points are almost wholly covered by the regression line.

bond_trend %>%
  filter(treasury.yield != "Output1") %>%
  mutate(Output1 = rep(treasury_data$Output1, 7)) %>%
  ggplot(aes(interest.rate, Output1)) +
  geom_jitter(aes(colour = treasury.yield), alpha = .2) +
  geom_smooth(method = "lm", size = 1.3) +
  facet_wrap(facets = "treasury.yield", nrow = 2) +
  theme(legend.position = "none") +
  labs(title = "Output1 vs all bond category interest rates- all variables have a very significant relationship",
       subtitle = "Scatteprlot patterns all display clustered points which are unevenly distributed  ",
       caption = "Source: Course Project Treasury Data")

These scatterplots also portend the fitted values versus the actuals quite well too. As before, I’ve collected all the fitted values, albeit using lapply. These further confirm the very high model fits. Again, USGG3YR displays a nearly perfect fit, though the plots for USGG2YR and USGG5YR show very close fits as well.

These are in line with my expectations following the linear model summary reviews. The highest R2 models seem to have the best fitted values. One slight surprise is the USGG30YR seems to have slightly more erratic fitted values, specifically in the 2000s, when compared to the best performing models. Similarly, the USGG3M, USGG6M, and USGG10YR have periods where they differ from the actual values. However, even in these cases the models still include fitted values that are in close alignment with the original inputs.

fitted_values <- data.frame(
  lapply(treasury_data[,2:8], 
         function(x) lm(x ~ Output1, data = treasury_data)$fitted.values)
)

fitted_values <- melt(fitted_values,
                      value.name = "fitted.values", 
                      variable.name = "treasury.yield")

bond_trend <- bond_trend %>%
  filter(treasury.yield != "Output1")

bond_trend <- cbind(bond_trend, fitted_values$fitted.values)

bond_trend %>%
  rename(fitted.values = `fitted_values$fitted.values`) %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = interest.rate), colour = "royalblue3", alpha = .75) +
  geom_line(aes(y = fitted.values), size = 1.3, 
            colour = "orange2", alpha = .25) +
  facet_wrap(facets = "treasury.yield", nrow = 2) +
  labs(title = "Input values plotted against fitted regression values for all bond categories",
       subtitle = "All models have very close alignment between actual (blue line) and fitted (orange line)- 3YR is closest (Adj. R2 = 0.998)",
       caption = "Source: Course Project Treasury Data")

Another crucial part of regression is checking on residuals against fitted values. While the previous plot shows the very close fit, it’s also instructive to check to see how the residual and fitted are distributed using a scatterplot. For convenience, I’ve made a data frame to check all the bond categories simultaneously.

In an ideal situation, the scatterplot should be randomly distributed but, with values as close to zero as possible. However, these plots do not show that. In fact, they all exhibit systematic error, which is not ideal. This again points to some underlying systematic issue with the variables; I think this adds further strength to the idea the output is an amalgam of the inputs in some way.

bond_categories <- colnames(treasury_data[,2:8])

residual_check <- data.frame(
  model = factor(rep(bond_categories, each = 8300), levels = bond_categories),
  resid = c(linear3M$residuals, linear6M$residuals, linear2YR$residuals,
            linear3YR$residuals, linear5YR$residuals, linear10YR$residuals,
            linear30YR$residuals),
  fitted = c(linear3M$fitted.values, linear6M$fitted.values, linear2YR$fitted.values,
             linear3YR$fitted.values, linear5YR$fitted.values, linear10YR$fitted.values,
             linear30YR$fitted.values))

residual_check %>%
  ggplot(aes(fitted, resid, colour = model)) +
  geom_jitter(alpha = .2, show.legend = F) +
  geom_hline(yintercept = 0, colour = "dodgerblue2", size = 1.3) +
  facet_wrap(facets = "model", nrow = 2, scales = "free_y") +
  labs(title = "Residuals vs fitted values for simple linear models from all bond categories",
       subtitle = "Points should be randomly distributed around zero; however, all plots seem to have systematic patterns signalling model issues",
       caption = "Source: Course Project Treasury Data",
       y = "model residuals",
       x = "fitted values")


Part 3: Linear modeling using inputs as outcome variable

Moving into the next section, I’m developing linear models but, with the output as the predictor this time.

linear3M_output <- lm(treasury_data[,2] ~ Output1, data = treasury_data)

linear6M_output <- lm(treasury_data[,3] ~ Output1, data = treasury_data)

linear2YR_output <- lm(treasury_data[,4] ~ Output1, data = treasury_data)

linear3YR_output <- lm(treasury_data[,5] ~ Output1, data = treasury_data)

linear5YR_output <- lm(treasury_data[,6] ~ Output1, data = treasury_data)

linear10YR_output <- lm(treasury_data[,7] ~ Output1, data = treasury_data)

linear30YR_output <- lm(treasury_data[,8] ~ Output1, data = treasury_data)

summary(linear3YR_output)
## 
## Call:
## lm(formula = treasury_data[, 5] ~ Output1, data = treasury_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23099 -0.10862 -0.01352  0.10095  0.83112 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.6444580  0.0017841    3164   <2e-16 ***
## Output1     0.4063541  0.0002036    1996   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1625 on 8298 degrees of freedom
## Multiple R-squared:  0.9979, Adjusted R-squared:  0.9979 
## F-statistic: 3.984e+06 on 1 and 8298 DF,  p-value: < 2.2e-16

Collect all slopes and intercepts in one table and print this table

As per the instructions, I’ve collected all the slopes and intercepts with the same basic apply function used for the initial linear models. While the function is slightly redundant here, I wanted to get some development practice, so it’s included.

The intercepts follow the same basic pattern as the initial linear models and rise slowly from USGG3M to USGG30YR, though the values make up a narrower range. The slopes follow a similar construction as well. Perhaps more interesting is that both slope t-values are the same in both models. The formula for slope t-value is the slope coefficient divided by standard error and despite different values from both models, the t-value ends up being identical. How does this work?

It’s actually an instructive point to review. Despite differences in slopes and standard errors between models, since the same two variables are being used, the test statistic ends up being the same. Given a t-value measures the size of the difference relative to the variation in a sample, the switched input and output don’t change this. In the same vein, the model R2 do not change given the amount of variance being explained by each model does not change. While a slight digression, it’s important to delve into the underlying statistical work and review why these coefficients end up being the same number.

output_coefficients <- function(columns){
  input_df <- data.frame(
  Model = colnames(treasury_data[,columns]),
  t(apply(treasury_data[,columns], 2, function(x) summary(lm(x ~ Output1, treasury_data))$coefficients[c(1, 2, 5, 6, 7, 8)])))
  
  rownames(input_df) <- NULL
  
  input_df %>%
    rename(Intercept = X1,
           Slope = X2,
           Intercept.t.value = X3,
           Slope.t.value = X4,
           Intercept.p.value = X5,
           Slope.p.value = X6) %>% custom_kable()
}

(output_coef <- output_coefficients(2:8))  
Model Intercept Slope Intercept.t.value Slope.t.value Intercept.p.value Slope.p.value
USGG3M 4.675134 0.3839609 643.9555 463.4638 0 0
USGG6M 4.844369 0.3901870 796.0347 561.8683 0 0
USGG2YR 5.438888 0.4151851 2341.9882 1566.6908 0 0
USGG3YR 5.644458 0.4063541 3163.8133 1995.9988 0 0
USGG5YR 6.009421 0.3860610 1767.6898 995.1678 0 0
USGG10YR 6.481316 0.3477544 1086.5690 510.8975 0 0
USGG30YR 6.869355 0.3047124 891.2273 346.4413 0 0

Part 4: Logistic regression to analyze Fed easing and tightening

Onto the logistic component focusing on federal reserve easing and tightening. This section starts with some data preparation. The initial set only has a one where the period of easing or tightening takes place. However, without any other markers, there isn’t much to work with. To this end, this first chunk starts the process of transforming and combining these columns by making any period of easing equal to zero. Given this is a binary marker, since the fed cannot be simultaneously easing and tightening, these are now prepped for combination to make a period of tightening equal to one and easing zero.

federal_cycles <- treasury_data %>%
  mutate(Easing = ifelse(Easing == 1, 0, NA),
         Tightening = ifelse(Tightening == 1, 1, NA))

Staying with the flow of the assignment, I checked against the original copy to make sure my transformed values match. In this case, they did.

custom_kable(federal_cycles[c(550:560, 900:910, 970:980),c(1, 10, 11)])
Date Easing Tightening
550 1983-03-29 NA NA
551 1983-03-30 NA NA
552 1983-03-31 NA NA
553 1983-04-04 NA 1
554 1983-04-05 NA 1
555 1983-04-06 NA 1
556 1983-04-07 NA 1
557 1983-04-08 NA 1
558 1983-04-11 NA 1
559 1983-04-12 NA 1
560 1983-04-13 NA 1
900 1984-08-27 NA 1
901 1984-08-28 NA 1
902 1984-08-29 NA 1
903 1984-08-30 NA 1
904 1984-08-31 NA 1
905 1984-09-04 NA NA
906 1984-09-05 NA NA
907 1984-09-06 NA NA
908 1984-09-07 NA NA
909 1984-09-10 NA NA
910 1984-09-11 NA NA
970 1984-12-10 0 NA
971 1984-12-11 0 NA
972 1984-12-12 0 NA
973 1984-12-13 0 NA
974 1984-12-14 0 NA
975 1984-12-17 0 NA
976 1984-12-18 0 NA
977 1984-12-19 0 NA
978 1984-12-20 0 NA
979 1984-12-21 0 NA
980 1984-12-24 0 NA

As alluded to, the end goal here was to combine the federal monetary policy markers into one column, now called tightening. I’ve also created a stand-alone data frame for this component to avoid transforming the original data too much. In the same vein, there are numerous NA’s (about 70%) so making a reduced set makes sense.

federal_cycles <- federal_cycles %>%
  filter(!is.na(Easing) | !is.na(Tightening)) %>%
  mutate(Tightening = ifelse(is.na(Tightening), 0, Tightening)) %>%
  select(-Easing)

I thought doing a quick profile of the new set was reasonable since so many values were dropped. The table shows that federal reserve easing or tightening took place between fall 1981 and fall 1992, a period spanning slightly over a decade. To put this in perspective, the entire data set spans from January 1981 to June 2014, so a considerable portion has been removed for this component of the analysis.

federal_cycles %>%
  summarise(min.date.range = min(Date),
            max.date.range = max(Date),
            date.range = difftime(max.date.range, min.date.range, units = "days"),
            years.total = round(as.numeric(date.range) / 365)) %>%
  custom_kable()
min.date.range max.date.range date.range years.total
1981-11-02 1992-09-04 3959 days 11

Additionally, the variable is fairly unbalanced. There are far more easing than tightening samples (1585 vs 773).

federal_cycles %>%
  mutate(Tightening = ifelse(Tightening == 0, "Easing", "Tightening")) %>%
  group_by(Tightening) %>%
  count() %>%
  summarise(period_count = n,
            variable_percent = n / nrow(federal_cycles) * 100) %>%
  custom_kable()
Tightening period_count variable_percent
Easing 1585 67.21798
Tightening 773 32.78202

Plot the data and the binary output variable representing easing (0) and tightening (1) periods

Before jumping into the plots, some context is necessary. So far, I’ve eschewed reviewing the nuance of what easing and tightening entails from a monetary policy perspective. With this concept being essential to this section though, I’ve included a brief overview here. Central banks around the world use monetary policy to regulate specific factors within the economy. Central banks most often use the federal funds rate as a leading tool for regulating market factors. This can entail using rate setting as part of the fiscal policy tool kit. Tightening policy occurs when central banks raise the federal funds rate, and easing occurs when central banks lower the federal funds rate (Investopedia, 2018). With that in mind, the plot should highlight rates going up and down depending on federal policy mandate.

Per the definition, these periods closely correspond to rate changes depending on the cycle. Points when the orange line is at 20 correspond to tightening periods; when the line drops to zero, there’s an easing period. As is evident, all the bond rates and the output closely mirror these policy changes, albeit with different levels.

This plot only offers a one-dimensional, inward look at these cycles. There aren’t really any wider macro insights to be gained here about the US economy or market more generally since the predictors only focus on the fed. That said, these periods can easily be picked up so this would be an interesting follow up analysis.

As an aesthetic note, I thought the original plot was too busy with colour so I made all the yields gray while highlighting the output and federal reserve monetary cycles. I think this makes added sense given my intuition that the output seems to be a combination of the input predictors, something which this plot reasonably strengthens given how all the variables move in such close concert during the easing and tightening.

fed_trend <- federal_cycles %>%
  select(-Date) %>%
  mutate(Tightening = Tightening * 20,
         index = 1:nrow(federal_cycles)) %>%
  melt(value.name = "interest.rate", 
       variable.name = "treasury.yield",
       id.vars = "index")

fed_trend %>%
  ggplot(aes(index, interest.rate, colour = treasury.yield)) +
  geom_line(aes(size = treasury.yield)) +
  scale_y_continuous(breaks = seq(-20, 30, 5)) +
  scale_colour_manual(values = c("lightgray", "lightgray", "lightgray", "lightgray",
                                 "lightgray", "lightgray", "lightgray", "royalblue2",
                                 "darkorange"),
                      name = "Treasury.Legend",
                      breaks = c("USGG3M", "Output1", "Tightening"),
                      labels = c("Bonds", "Output1", "Monetary.Tightening")) +
  scale_size_manual(values = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.75, 1.5),
                    guide = "none") +
  guides(colour = guide_legend(override.aes = list(size = 2))) +
  labs(title = "Bond yields, output, and federal tightening periods- Includes 2358 days between 1981 and 1992",
       subtitle = "Rates and output seem to have positive association with federal activities (i.e. rise in tightening, fall during easing)",
       y = "Data and Binary Federal Tightening",
       x = "Samples",
       caption = "Source: Course Project Treasury Data")

Estimate logistic regression with 3M yields as predictors for easing/tightening output and then plot the fitted values

This simple logistic model is useful for understanding whether the federal reserve is either easing or tightening based on the USGG3M interest rate levels. The model summary offers insights towards this end.

Overall, the slope and intercept are very significant as indicated by their respective p-values. This isn’t all that surprising given the two variables are inherently linked from a practical perspective given the federal reserve lowers or raises during a period of easing or tightening.

Moving onto the slope coefficient, the summary indicates that for a one unit increase in USGG3M, the log odds of being in a tightening period increase by about 0.186, or about 1.2% in real terms. This explanation is perfunctory however and requires some more pragmatic context. Harkening back to what the Federal Reserve is doing with tightening, these periods mark an increase in interest rates. This policy direction is therefore, by definition, noted for rising rates. As such, the model picks up this intuitive relationship between a tightening period and interest rate increases. The model verifies a real-world association and signals that as rates rise, the likelihood of being in a tightening period, denoted with log odds, increases.

On top of this, the samples included for the model are only between 1981 and 1992 when the rates tended to be much higher than in the later 1990s and 2000s. As such, the relationship is likely accentuated more than would be if the entire sample were included.

logistic_3M <- glm(Tightening ~ USGG3M, 
                   data = federal_cycles,
                   family = binomial(link = logit))

summary(logistic_3M)
## 
## Call:
## glm(formula = Tightening ~ USGG3M, family = binomial(link = logit), 
##     data = federal_cycles)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4239  -0.9014  -0.7737   1.3548   1.6743  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.15256    0.17328 -12.422   <2e-16 ***
## USGG3M       0.18638    0.02144   8.694   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2983.5  on 2357  degrees of freedom
## Residual deviance: 2904.8  on 2356  degrees of freedom
## AIC: 2908.8
## 
## Number of Fisher Scoring iterations: 4

As always, a visual representation of this association is instructive for better understanding its dynamic. While logistic models deal with binary outcome variables, bivariate plots with regression lines can still be constructed using ggplot2. The output is quite similar to a normal scatterplot with a line but, the y-axis here is either a one or zero signifying the monetary period. Despite this, the logistic line still highlights the relationship even though the outcome variable is binary (owing to a binomial family line function). This further confirms the logistic model summary and shows the positive association between tightening periods and increasing rates.

federal_cycles %>%
  ggplot(aes(USGG3M, Tightening)) +
  geom_point(alpha = .1, colour = "darkorange") +
  geom_smooth(method = "glm", size = 1.5, se = FALSE,
    method.args = list(family = "binomial")) +
    labs(title = "Federal Reserve easing and tightening periods vs USGG3M yield rates with logistic line",
       subtitle = "As yield rates rise, the likeliehood of being in a tightening period (1) increases",
       y = "Binary Federal Tightening",
       caption = "Source: Course Project Treasury Data")

Continuing with visualizations which aid the model, I’ve plotted the logistic fitted values against the other bonds rates, output, and monetary policy periods. The log odds have been transformed into interest rate units by multiplying the variable by 20 (which is not mathematically derived and is only done for effect). Both the fitted values and original interest rates are highlighted on the plot so as to make comparison easier. The model provides close estimates for the interest rates. Interestingly, the shape of the lines are almost identical but, the fitted values are offset by several points below the actuals. As such, the model seems to provide an output with high, though consistent, bias alongside low variance (i.e. fitted values are consistently low but without much other variation from the original rates).

Despite the transformation being done for visual appeal, it makes the log odds appear favourably to the originals. This is a good sign that the model picks up the association between interest rates and tightening. The clear uptick of fitted values during the known tightening periods and drop during easing further confirms the model seems to capture association between rates and policy quite well.

fed_trend <- data.frame(
  index = 1:nrow(federal_cycles),
  treasury.yield = "Fitted.Values",
  interest.rate = logistic_3M$fitted.values * 20) %>%
  bind_rows(fed_trend)
  
fed_trend %>%
  ggplot(aes(index, interest.rate, colour = treasury.yield)) +
  geom_line(aes(size = treasury.yield)) +
  scale_y_continuous(breaks = seq(-20, 30, 5)) +
  scale_colour_manual(values = c("#B452CD", "lightgray", "darkorange", "lightgray",
                                 "lightgray", "lightgray", "royalblue2", "lightgray",
                                 "lightgray", "lightgray"),
                      name = "Treasury.Legend",
                      breaks = c("Fitted.Values", "USGG3M", "Output1", "Tightening"),
                      labels = c("Fitted.Values", "USGG3M", 
                                 "Bonds and Output1", "Monetary.Tightening")) +
  scale_size_manual(values = c(1, 0.75, 1.5, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1),
                    guide = "none") +
  guides(colour = guide_legend(override.aes = list(size = 2))) +
  labs(title = "Bond yields, output, and federal tightening periods- Includes 2358 days between 1981 and 1992",
       subtitle = "Rates and output seem to rise in tightening and fall during easing- Logistic regression fitted values follow this trend",
       y = "Data and Binary Federal Tightening",
       x = "Samples",
       caption = "Source: Course Project Treasury Data")

Use all inputs as predictors for logistic regression

Building on the previous work, I developed a logistic regression model with all the bond categories.

logistic_full <- glm(Tightening ~ USGG3M + USGG6M + USGG2YR +
                       USGG3YR + USGG5YR + USGG10YR + USGG30YR, 
                   data = federal_cycles,
                   family = binomial(link = logit))

Explore the model. Interpret the coefficients and the fitted values

In stark contrast to the simple logistic model, this summary is filled with contradictions and is difficult to interpret. All the model slopes are very significant, save for the 10 year bond class, which at a .32 p-value is not especially close. Further, the slope coefficients seem almost antithetical: three predictors suggest a decrease in log odds (including USGG3M, which was positive in the simple logistic model) while four indicate an increase for predicting easing and tightening. The log odds also don’t make sense. For example, the USGG3M coefficient shows -3.3456 log odds decrease (about 28% when exponentiated), which is essentially cancelled out by the USGG30YR log odds at 3.3254 (about 28% as well). When viewed in totality, the exponentiated log odds almost cancel each other out given a balance of increases and decreases. All considered, these results are erratic and as a result, model interpretation suffers.

My guess for why this model seems erratic is based on having too many correlated predictors. All the bond categories are highly, if not almost perfectly, correlated and this can obfuscate individual predictor contributions to the model and generally, provide some confusing coefficients. Collinearity, or multi-collinearity in this case, amongst predictors can increase standard error coefficients leading to non-significance. This is a major issue because the model is being used to determine the effect of each predictor during a specific policy cycle.

summary(logistic_full)
## 
## Call:
## glm(formula = Tightening ~ USGG3M + USGG6M + USGG2YR + USGG3YR + 
##     USGG5YR + USGG10YR + USGG30YR, family = binomial(link = logit), 
##     data = federal_cycles)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2113  -0.8595  -0.5935   1.1306   2.5530  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.7552     0.4312 -11.029  < 2e-16 ***
## USGG3M       -3.3456     0.2666 -12.548  < 2e-16 ***
## USGG6M        4.1559     0.3748  11.089  < 2e-16 ***
## USGG2YR       3.9460     0.7554   5.224 1.75e-07 ***
## USGG3YR      -3.4642     0.9340  -3.709 0.000208 ***
## USGG5YR      -3.2115     0.7795  -4.120 3.79e-05 ***
## USGG10YR     -0.9705     0.9764  -0.994 0.320214    
## USGG30YR      3.3254     0.6138   5.418 6.04e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2983.5  on 2357  degrees of freedom
## Residual deviance: 2629.6  on 2350  degrees of freedom
## AIC: 2645.6
## 
## Number of Fisher Scoring iterations: 4

Building on the previous points, developing a model with only two terms, the 3 and 6 month category, further highlights the effect of colinearity. Even with only one extra term, the USGG3M predictor shows a complete change from a single predictor model and becomes negative. Again, I feel this might be driven by colinearity as the extra term inclusion here drives a very big change in predictor coefficients.

logistic_months <- glm(Tightening ~ USGG3M + USGG6M,
                   data = federal_cycles,
                   family = binomial(link = logit))

summary(logistic_months)
## 
## Call:
## glm(formula = Tightening ~ USGG3M + USGG6M, family = binomial(link = logit), 
##     data = federal_cycles)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1242  -0.8310  -0.6841   1.2849   1.9546  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.4261     0.1802  -13.46   <2e-16 ***
## USGG3M       -1.7899     0.1921   -9.32   <2e-16 ***
## USGG6M        1.9353     0.1873   10.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2983.5  on 2357  degrees of freedom
## Residual deviance: 2783.2  on 2355  degrees of freedom
## AIC: 2789.2
## 
## Number of Fisher Scoring iterations: 4

The initial model was easy to understand but, did have a higher AIC. This metric is used in model comparison and a lower AIC is generally desirable. However, interpretation is crucial and despite having a better AIC, the full model is more difficult to work with given one of the key goals here is understanding the relationship between interest rates and federal reserve monetary policy cycles. Additionally, the assumptions of AIC, namely that the model residuals are i.i.d Gaussian, are not met so it might not be appropriate in this scenario.

AIC(logistic_3M, logistic_full) %>% 
  custom_kable()
df AIC
logistic_3M 2 2908.764
logistic_full 8 2645.579

The fitted values plot is very noisy and doesn’t help illuminate what policy period the fed is in. For example, in the first easing window, the values range from about eighteen to zero and fluctuate substantially. This window isn’t atypical either as periods of easing and tightening see erratic fitted value changes which are out of sync with the policy cycles. As aforementioned, the logistic model is being used primarily for driving a more thorough understanding of the easing and tightening periods using interest rates. By plotting the fitted values, its possible to visualize rates and fitted values over time as well. For this reason, the simpler model with greater interpretation seems like a better choice in this case.

fed_trend <- data.frame(
  index = 1:nrow(federal_cycles),
  treasury.yield = "Fitted.Values",
  interest.rate = logistic_full$fitted.values * 20) %>%
  bind_rows(fed_trend)
  
fed_trend %>%
  ggplot(aes(index, interest.rate, colour = treasury.yield)) +
  geom_line(aes(size = treasury.yield)) +
  scale_y_continuous(breaks = seq(-20, 30, 5)) +
  scale_colour_manual(values = c("royalblue2", "lightgray", "darkorange", "lightgray",
                                 "lightgray", "lightgray", "lightgray", "lightgray",
                                 "lightgray", "lightgray"),
                      name = "Treasury.Legend",
                      breaks = c("Fitted.Values", "Output1", "Tightening"),
                      labels = c("Fitted.Values", "Bonds & Output1",
                                 "Monetary.Tightening")) +
  scale_size_manual(values = c(0.1, 0.1, 1.5, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1),
                    guide = "none") +
  guides(colour = guide_legend(override.aes = list(size = 2))) +
  labs(title = "Bond yields, output, and federal tightening periods- Includes 2358 days between 1981 and 1992",
       subtitle = "Rates and output seem to rise in tightening and fall during easing- Logistic regression fitted values follow this trend",
       y = "Data and Binary Federal Tightening",
       x = "Samples",
       caption = "Source: Course Project Treasury Data")

However, because I was curious about how each would do in a prediction task, I split up the logistic regression data frame and developed models for USGG3M and all predictors on training data. Using these on a test set, it’s possible to gauge which model had better predictive utility. Neither is all that good, probably in some part owing to an unbalanced sample, but the full model has a higher balanced accuracy (59 % vs 46%). While this is certainly a digression, this provides some contextual support for how an analysis goal helps shape which model is preferable in a given situation. While the prediction task was better using the full model, its interpretation is still lacking for an inquiry based on better understanding relationships.

library(caret)

federal_cycles <- federal_cycles %>%
  mutate(Tightening = as.factor(Tightening))

set.seed(1017)
data_split <- createDataPartition(y = federal_cycles$Tightening, p = .7, list = F)

fed_train <- federal_cycles %>%
  slice(data_split)

fed_test <- federal_cycles %>%
  slice(-data_split)

train_3M <- glm(Tightening ~ USGG3M, 
                data = federal_cycles,
                family = binomial(link = logit))
  
train_full <-  glm(Tightening ~ USGG3M + USGG6M + USGG2YR +
        USGG3YR + USGG5YR + USGG10YR + USGG30YR, 
        data = federal_cycles,
        family = binomial(link = logit))

model_predictions <- data.frame(
  three_probs = predict(object = train_3M, newdata = fed_test, type = "response"),
  full_probs = predict(object = train_full, newdata = fed_test, type = "response")
)

model_predictions <- model_predictions %>%
  mutate(three_preds = ifelse(three_probs > .5, "1", "0"),
         full_preds = ifelse(full_probs > .5, "1", "0"))

confusionMatrix(model_predictions$three_preds, fed_test$Tightening)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 437 231
##          1  38   0
##                                          
##                Accuracy : 0.619          
##                  95% CI : (0.582, 0.6549)
##     No Information Rate : 0.6728         
##     P-Value [Acc > NIR] : 0.9989         
##                                          
##                   Kappa : -0.1019        
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9200         
##             Specificity : 0.0000         
##          Pos Pred Value : 0.6542         
##          Neg Pred Value : 0.0000         
##              Prevalence : 0.6728         
##          Detection Rate : 0.6190         
##    Detection Prevalence : 0.9462         
##       Balanced Accuracy : 0.4600         
##                                          
##        'Positive' Class : 0              
## 
confusionMatrix(model_predictions$full_preds, fed_test$Tightening)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 437 169
##          1  38  62
##                                           
##                Accuracy : 0.7068          
##                  95% CI : (0.6717, 0.7402)
##     No Information Rate : 0.6728          
##     P-Value [Acc > NIR] : 0.02886         
##                                           
##                   Kappa : 0.2205          
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9200          
##             Specificity : 0.2684          
##          Pos Pred Value : 0.7211          
##          Neg Pred Value : 0.6200          
##              Prevalence : 0.6728          
##          Detection Rate : 0.6190          
##    Detection Prevalence : 0.8584          
##       Balanced Accuracy : 0.5942          
##                                           
##        'Positive' Class : 0               
## 

Calculate and plot log-odds and probabilities. Compare these probabilities with fitted values

I’ve put together a faceted plot to review the logistic log odds and probabilities at once. While the log odds have a different scale, it is still the same basic shape as both of the other probabilities. The exponentiated log odds, labeled here as probability, are the same as fitted values so both these lines are identical. This is because the probability is derived from the model’s predicted values, gathered using predict, and these provide the untransformed fitted values. In fact, by adding the term type = response to the predict call provides the converted probabilities. As such, showing both is somewhat redundant but, it does shed light on how R performs certain tasks which is always useful.

logistic_odds <- data.frame(
  odds = "log.odds",
  values = predict(logistic_full))

logistic_odds <- data.frame(
  odds = "probability",
  values = 1 / (exp(-logistic_odds$values) + 1)) %>%
  bind_rows(logistic_odds)

logistic_odds <- data.frame(
  odds = "fitted.values",
  values = logistic_full$fitted.values) %>%
  bind_rows(logistic_odds)

logistic_odds %>%
  mutate(Index = rep(1:nrow(federal_cycles), 3),
         odds = factor(odds, levels = c("log.odds", "probability", "fitted.values"))) %>%
  ggplot(aes(Index, values, colour = odds)) +
  geom_line(size = 1.3) +
  facet_wrap(facets = "odds", scales = "free_y", ncol = 1) +
  scale_colour_manual(values = c("royalblue2", "darkorange", "#EE6AA7")) +
  theme(legend.position = "none") +
    labs(title = "Log odds and probabilities for logistic regression model with all predictors",
         subtitle = "Both probability (orange line) and fitted values (pink line) are identical despite being derived differently (exp log odds vs glm fitted values)",
         y = "Fitted Values & Log-Odds",
         x = "Samples",
       caption = "Source: Course Project Treasury Data")


Part 5: Full and combination-based linear regression

This section focuses on developing linear models with more than one predictor, including one with all bond categories. I’ll work to find the most suitable linear model by looking at different combinations of predictors as well.

Estimate the full model by using all 7 predictors and provide interpretation. How good is the fit? How significant are the parameters?

As with the logistic regression, I’ve put together a data frame with only the variables needed for modelling.

bond_regression <- treasury_data %>%
  select(-Date, -Easing, -Tightening)

To start, I’ve put together a full model with all the predictors. This call differs slightly than the logistic model with all predictors since I’ve opted to use the y ~ . to signify all predictors instead of putting in their individual names. However, the output is still the same.

linear_full <- lm(Output1 ~ ., data = bond_regression)

I’ve collected the full model coefficients below. The slope and estimate coefficients are probably the least interesting part of this table. It highlights that a full model would have an intercept around -14.9 and that each predictor contributes between around .3 and .42 to output changes. Taken together, they sum up to about 2.64, which is about the unit change for one of the initial simple linear models. While this is a rough way of thinking about each contribution, given each predictor is so highly correlated I thought it was worth including.

What’s far more interesting here are the standard errors and t-values. Each of the inputs has a standard error of zero. In practical terms, this helps illuminate how far each fitted value is from the actuals; a smaller standard error points to better model fit and predictions. Here, there simply are no standard errors. The resulting significance testing, which utilizes these values, produces massive t-values and extremely significant tests. However, when a model seems too good to be true, there’s cause for thinking something is amiss.

summary(linear_full)$coefficients %>%
  custom_kable()
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.9041591 0 -141024294891 0
USGG3M 0.3839609 0 3893968285 0
USGG6M 0.3901870 0 2601053702 0
USGG2YR 0.4151851 0 1615851177 0
USGG3YR 0.4063541 0 1231735395 0
USGG5YR 0.3860610 0 1474449865 0
USGG10YR 0.3477544 0 1241860763 0
USGG30YR 0.3047124 0 1945195584 0

Building on this, the full model variance explanations (R2 and adjusted R2) confirm what I’ve alluded to throughout the analysis, namely that Output1 is some combination of the input variables. The final telltale sign here is perfect variance explanation. Values of 1 for R2 and adjusted R2 indicate there is no variance left unexplained. While this might seem ideal, it has obvious negative implications for any modelling task. Without any variance, in theory a perfect model fit, there really isn’t any model in a traditional sense. Since the output is some combination of the inputs, there’s no error or variance to model and the model simply describes the sample data; the predictors are the output, and the output is comprised of the predictors. With this in mind, the full regression model is not the best choice here given it just describes data negating the need for a model.

unlist(summary(linear_full)[8:9]) %>% 
  custom_kable()
r.squared 1
adj.r.squared 1

The degrees of freedom here refer to the number of predictors being used (8 from n - 1) while the larger number, df2, is the number of total samples minus total predictors (8300 - 8 = 8292 from n - k - 1). These are used during calculations for R2 and adjusted R2.

unlist(summary(linear_full)[7]) %>% 
  custom_kable()
df1 8
df2 8292
df3 8

As always, it’s good to check the full model residuals. The pattern here is unexpected for a model with an R2 of 1. I thought the all the points would be perfectly aligned on the line straddling zero but, they have a distinct “H” formation. When looking at the scale, these are all essentially zeroes, which is in line with my expectations. However, points to the left of 10 and right of -10 deviate into unique clusters. This would seem to indicate that at higher and lower output ranges, the fitted values are slightly less than perfect. Getting ahead a bit, since the output appears to be some combination, it’s possible it is less representative at these more extreme values (although this is a guess).

autoplot(linear_full, which = 1) +
  labs(title = "Fitted vs residuals for full linear model")

Estimate the Null model by including only intercept and explore the model. Why does the summary not show an R2?

I’ve put together a null model here, which only includes an intercept. To construct this model in R, the lm call includes only the output variable and a 1 for input, which denotes the intercept.

linear_null <- lm(Output1 ~ 1, data = bond_regression)

The null model is used to show what the regression would look like without any predictors. It essentially acts as the null hypothesis since the model cannot include any effects and hence, all coefficients must be zero. When looking at the table, this is true as seen with an intercept estimate of zero. Since there isn’t any hypothesis testing to be done, given this is essentially the null hypothesis, there is no t-value and the p-value is 1 (signifying the null hypothesis is true).

summary(linear_null)$coefficients %>% 
  custom_kable()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0 0.0961954 0 1

The previous explanation further situates why there is no variance being explained. To have any value for R2 and adjusted R2, there has to be predictors included which might explain some of the output variance. Without any predictors, there simply isn’t anything to explain variance, hence, by definition, there must be an R2 and adjusted R2 of zero when dealing with the null model.

unlist(summary(linear_null)[8:9]) %>% 
  custom_kable()
adj.r.squared 0
r.squared 0

Conduct anova test using full and null model

The anova highlights that there is a significant difference between the null model and the full model, which suggests that there is a relationship between the input variables and the output. In essence, this means that the null hypothesis can be rejected in favour the alternative, which is the difference is not equal to zero (i.e. the model has a slope greater than 0). This isn’t surprising given all the previous work leading up to the test. The analysis of variance is evaluating how much variance is attributable to model versus the null, so the result is in line with the other findings.

anova(linear_full, linear_null)
## Analysis of Variance Table
## 
## Model 1: Output1 ~ USGG3M + USGG6M + USGG2YR + USGG3YR + USGG5YR + USGG10YR + 
##     USGG30YR
## Model 2: Output1 ~ 1
##   Res.Df    RSS Df Sum of Sq        F    Pr(>F)    
## 1   8292      0                                    
## 2   8299 637400 -7   -637400 3.73e+22 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Repeat the analysis for different combinations of input variables and select the one you think is the best- explain your selection

I’ve already indicated why I don’t think the full model is preferable here but, to reiterate, since it’s got a perfect fit it doesn’t do much outside of describing data. There needs to be some variance left unexplained for a model to have some wider utility because there’s no need for a model if the fit is perfect.

What constitutes the best model here is tricky though. I’ve established why the full model isn’t useful from a statistical perspective but, I also think the analysis clearly points to the output being a combination of the inputs. With that in mind, models are essentially uncovering how much each variable explains about a combination of all the predictors. The utility of this might be to figure out which predictors by themselves, or in some non-full combination, explain the most for output1, thereby showing which is weighted most in the combination. In any case, there needs to be an R2 less than 1.

All considered, I think the best model is one that strikes a balance between high R2 and interpretability. In essence, I think the most preferable model is parsimonious, one which includes the fewest inputs with the most variance explained. Much like the logistic regression, these models are used for broad understanding so there should be a strong preference towards a model that can be explained and easily interpreted. Following these criteria, the model for USGG3YR is my selection for being the best, or more realistically, the most appropriate for the given situation. Viewing the anova test, the model is still very significant and, as the simple linear section showed, it outperforms other bond categories for variance explained.

anova(linear3YR, linear_null)
## Analysis of Variance Table
## 
## Model 1: Output1 ~ USGG3YR
## Model 2: Output1 ~ 1
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1   8298   1325                                   
## 2   8299 637400 -1   -636075 3984011 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since I haven’t yet shown any combinations, I’ll use a comparison to highlight why I prefer the single-term model. The chunk below develops a model including USGG3M, USGG6M, and USGG30YR. As the anova shows, its the preferred model and adds more explanatory power when compared to USGG3YR.

linear_mixed <- lm(Output1 ~ USGG3M + USGG6M + USGG30YR, data = bond_regression)

anova(linear_mixed, linear3YR)
## Analysis of Variance Table
## 
## Model 1: Output1 ~ USGG3M + USGG6M + USGG30YR
## Model 2: Output1 ~ USGG3YR
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   8296  762.53                                  
## 2   8298 1324.83 -2   -562.31 3058.8 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, the model summary provides insight into why I prefer the USGG3YR model. The same multi-collinearity seen in the logistic section plagues model interpretation here as well. USGG3M shows a negative relationship with the output while the other two remain positive. Again, this doesn’t conform to previous work and by this point, my intuition flags it as suspect as well. Additionally, this model only adds a minor amount to R2 when compared to the USGG3YR model (.9979 vs .9988).

This goes back to selecting a model based on the overall analysis goal. Since it’s fairly clear the output is a derivation of the inputs, the model shouldn’t, or plainly wouldn’t, be used for prediction. This leaves the model as useful for understanding how specific inputs are related to the output, which favours simple, interpretable results. For this, I think the USGG3YR is the best because it provides insight into the relationship between one bond category and the output while having the highest R2 and clear coefficients. As a final point, realistically any of the single term models does a reasonable job highlighting the input and output association but, the USGG3YR also seems to explain the most variance so it’s my choice.

summary(linear_mixed)
## 
## Call:
## lm(formula = Output1 ~ USGG3M + USGG6M + USGG30YR, data = bond_regression)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02535 -0.22005 -0.00836  0.20314  1.26245 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -16.16756    0.01165 -1387.28   <2e-16 ***
## USGG3M       -0.52808    0.01587   -33.28   <2e-16 ***
## USGG6M        2.13856    0.01656   129.14   <2e-16 ***
## USGG30YR      1.20484    0.00317   380.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3032 on 8296 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9988 
## F-statistic: 2.309e+06 on 3 and 8296 DF,  p-value: < 2.2e-16

Part 6: Rolling analysis, volatility, and pairwise analysis

Perform rolling window analysis of the yields data

Essentially, the rollapply function, with specific parameters set for width and by, takes the rolling mean of each bond category. Width tunes how many samples are taken (20) while by is used to set where each rolling mean starts (in this case, every fifth window, or time point). The final dimensions of the rolling means object shows how by works since it’s 1657 rows, or about 8300 divided 5.

window_width <- 20

window_shift <- 5

rolling_means <- rollapply(bond_regression,
                           width = window_width,
                           by = window_shift,
                           by.column = TRUE,
                           mean)

head(rolling_means) %>%
  custom_kable()
USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR Output1
15.0405 14.0855 13.2795 12.9360 12.7825 12.5780 12.1515 20.14842
15.1865 14.1440 13.4855 13.1085 12.9310 12.7370 12.3370 20.55208
15.2480 14.2755 13.7395 13.3390 13.1500 12.9480 12.5500 21.04895
14.9345 14.0780 13.7750 13.4765 13.2385 13.0515 12.6610 21.02611
14.7545 14.0585 13.9625 13.6890 13.4600 13.2295 12.8335 21.31356
14.6025 14.0115 14.0380 13.7790 13.5705 13.3050 12.8890 21.39061

It’s fairly straightforward to recreate how this works on a small scale. Basically, the function takes a mean for rows 1 to 20, then skips five rows and takes another for 6 to 25. This process is repeated across the entire data set for all columns (given by.colum is set to true).

roll_mean <- bond_regression %>%
  slice(6:25) %>%
  summarise(USGG3M_roll = mean(USGG3M),
            USGG6M_roll = mean(USGG6M),
            USGG2YR_roll = mean(USGG2YR),
            USGG3YR_roll = mean(USGG3YR),
            USGG5YR_roll = mean(USGG5YR),
            USGG10YR_roll = mean(USGG10YR),
            USGG30YR_roll = mean(USGG30YR))

bond_regression %>%
  slice(1:20) %>%
  summarise(USGG3M_roll = mean(USGG3M),
            USGG6M_roll = mean(USGG6M),
            USGG2YR_roll = mean(USGG2YR),
            USGG3YR_roll = mean(USGG3YR),
            USGG5YR_roll = mean(USGG5YR),
            USGG10YR_roll = mean(USGG10YR),
            USGG30YR_roll = mean(USGG30YR)) %>%
  bind_rows(roll_mean) %>%
  custom_kable()
USGG3M_roll USGG6M_roll USGG2YR_roll USGG3YR_roll USGG5YR_roll USGG10YR_roll USGG30YR_roll
15.0405 14.0855 13.2795 12.9360 12.7825 12.578 12.1515
15.1865 14.1440 13.4855 13.1085 12.9310 12.737 12.3370

The next chunk highlights these windows which I described. Starting at 1 and moving across the row, the final column value is 20. The next row is the 6 to 25 window which was used to calculate the mean above as well.

window_count <- seq(bond_regression$USGG3M)

rolling_window <- rollapply(window_count,
                            width = window_width,
                            by = window_shift,
                            by.column = FALSE,
                            FUN = function(z) z)

head(rolling_window, 10) %>%
  custom_kable()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

This object helps set up filtering so when these values need to be plotted, they can be placed into the right spot with their corresponding USGG3M.

calculation_points <- rolling_window[,10]

c(calculation_length = length(calculation_points)) %>%
  custom_kable()
calculation_length 1657

My code diverges from the provided example here. The original code works towards setting up a matrix with USGG3M in one column and the rolling mean calculation in another. Using this, the variables are visualized. I’ve done the same thing but, used a data frame so I can utilize ggplot2 instead.

plot_means <- bond_regression %>%
  select(USGG3M) %>%
  mutate(mean_calculation = 0)
  
plot_means$mean_calculation[calculation_points] <- rolling_means[,1]

head(plot_means, 20) %>%
  custom_kable()
USGG3M mean_calculation
13.52 0.0000
13.58 0.0000
14.50 0.0000
14.76 0.0000
15.20 0.0000
15.22 0.0000
15.24 0.0000
15.08 0.0000
15.25 0.0000
15.15 15.0405
15.79 0.0000
15.32 0.0000
15.71 0.0000
15.72 0.0000
15.70 15.1865
15.35 0.0000
15.20 0.0000
15.06 0.0000
14.87 0.0000
14.59 15.2480

However, I still get the same plot, albeit formatted differently. As seen, the rolling mean points follow the general trend line of USGG3M.

plot_means %>%
  mutate(index = seq(plot_means$mean_calculation),
         mean_marker = ifelse(mean_calculation != 0, T, F)) %>%
  ggplot(aes(x = index, colour = mean_marker)) +
  geom_point(aes(y = mean_calculation, alpha = mean_marker), size = .75) +
  geom_line(aes(y = USGG3M), colour = "darkgray", size = 2, alpha = .4) +
  scale_alpha_discrete(range = c(0, 1), guide = FALSE) +
  scale_colour_manual(values = c("white", "darkorange"),
                      name = "Rolling Calculation",
                      breaks = "TRUE",
                      labels = "mean") +
  labs(title = "USGG3M bond yields (gray line) and rolling mean values (orange dots)",
       y = "USGG3M (Value and Rolling Mean)",
       x = "Index",
       caption = "Source: Course Project Treasury Data")

Run the rolling daily difference for the standard deviation of each variable

Executing this requires making a matrix containing the difference between variables before a rollaply for the standard deviation can be applied. To do this, I’ve used apply with the diff function to take the difference of each column (marked by 2). Thereafter, I’ve added in the dates using rownames so the samples can be verified.

sd_diff <- apply(bond_regression, 2, diff)

rownames(sd_diff) <- treasury_data$Date[-1]

head(sd_diff) %>%
  custom_kable()
USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR Output1
4023 0.06 0.07 0.14 0.03 -0.08 -0.04 0.00 0.0758722
4024 0.92 0.74 0.50 0.47 0.40 0.27 0.22 1.3559162
4025 0.26 0.10 0.17 0.17 0.07 -0.03 0.02 0.3011961
4026 0.44 0.30 0.44 0.33 0.20 0.22 0.22 0.8235321
4029 0.02 -0.07 -0.36 -0.34 -0.17 -0.12 -0.05 -0.4298574
4030 0.02 -0.13 0.13 0.03 -0.03 0.08 0.00 0.0393581

From there, the same rollapply method used for means can be done for standard deviation. I’ve also added the date variable back into the bond regression set as it’s needed for visualizations.

Before moving on, I’ll address what the rolling standard deviation might be useful for. Standard deviation here essentially comprises a volatility rating. Since the rates are reasonably homogeneous in short date windows, any increased standard deviation marks rate fluctuations. With this in mind, it will be useful for picking out adverse financial events. Major events to watch for include the 2008 recession, which act as an intuitive benchmark to gauge how well this rolling standard deviation captures volatility.

bond_regression <- treasury_data %>%
  select(Date) %>%
  bind_cols(bond_regression)

rolling_sd <- rollapply(sd_diff,
                           width = window_width,
                           by = window_shift,
                           by.column = TRUE,
                           sd)

head(rolling_means) %>%
  custom_kable()
USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR Output1
15.0405 14.0855 13.2795 12.9360 12.7825 12.5780 12.1515 20.14842
15.1865 14.1440 13.4855 13.1085 12.9310 12.7370 12.3370 20.55208
15.2480 14.2755 13.7395 13.3390 13.1500 12.9480 12.5500 21.04895
14.9345 14.0780 13.7750 13.4765 13.2385 13.0515 12.6610 21.02611
14.7545 14.0585 13.9625 13.6890 13.4600 13.2295 12.8335 21.31356
14.6025 14.0115 14.0380 13.7790 13.5705 13.3050 12.8890 21.39061

I’m collecting the dates that accord to the rolling standard deviation measurements so they can be merged with the calculated values and analyzed. In the previous chunk I reintroduced date into the data frame and here, I’m using rollapply with a customized paste function the collect the samples. This yields a large character matrix which captures the rolling window dates.

rolling_dates <- rollapply(bond_regression[-1,1],
                           width = window_width,
                           by = window_shift,
                           by.column = FALSE,
                           function(x) paste0(x))

head(rolling_dates) %>%
  custom_kable()
1981-01-06 1981-01-07 1981-01-08 1981-01-09 1981-01-12 1981-01-13 1981-01-14 1981-01-15 1981-01-16 1981-01-19 1981-01-20 1981-01-21 1981-01-22 1981-01-23 1981-01-26 1981-01-27 1981-01-28 1981-01-29 1981-01-30 1981-02-02
1981-01-13 1981-01-14 1981-01-15 1981-01-16 1981-01-19 1981-01-20 1981-01-21 1981-01-22 1981-01-23 1981-01-26 1981-01-27 1981-01-28 1981-01-29 1981-01-30 1981-02-02 1981-02-03 1981-02-04 1981-02-05 1981-02-06 1981-02-09
1981-01-20 1981-01-21 1981-01-22 1981-01-23 1981-01-26 1981-01-27 1981-01-28 1981-01-29 1981-01-30 1981-02-02 1981-02-03 1981-02-04 1981-02-05 1981-02-06 1981-02-09 1981-02-10 1981-02-11 1981-02-13 1981-02-17 1981-02-18
1981-01-27 1981-01-28 1981-01-29 1981-01-30 1981-02-02 1981-02-03 1981-02-04 1981-02-05 1981-02-06 1981-02-09 1981-02-10 1981-02-11 1981-02-13 1981-02-17 1981-02-18 1981-02-19 1981-02-20 1981-02-23 1981-02-24 1981-02-25
1981-02-03 1981-02-04 1981-02-05 1981-02-06 1981-02-09 1981-02-10 1981-02-11 1981-02-13 1981-02-17 1981-02-18 1981-02-19 1981-02-20 1981-02-23 1981-02-24 1981-02-25 1981-02-26 1981-02-27 1981-03-02 1981-03-03 1981-03-04
1981-02-10 1981-02-11 1981-02-13 1981-02-17 1981-02-18 1981-02-19 1981-02-20 1981-02-23 1981-02-24 1981-02-25 1981-02-26 1981-02-27 1981-03-02 1981-03-03 1981-03-04 1981-03-05 1981-03-06 1981-03-09 1981-03-10 1981-03-11

Finally, I’ve unified all the composite pieces in a single data frame with all the required rolling standard deviation calculations for USGG3M, USGG5YR, USGG30YR, and Output1. As with before, the samples for the middle of the dates window (hence the 10th column).

rownames(rolling_sd) <- rolling_dates[,10]

plot_sd <- as.data.frame(rolling_sd) %>%
  add_rownames(var = "Date") %>%
  select(Date, USGG3M, USGG5YR, USGG30YR, Output1)

head(plot_sd) %>%
  custom_kable()
Date USGG3M USGG5YR USGG30YR Output1
1981-01-19 0.3433258 0.1713192 0.1202147 0.5639875
1981-01-26 0.2933383 0.1450082 0.1192201 0.4707427
1981-02-02 0.2613180 0.1654110 0.1351909 0.4681168
1981-02-09 0.2551754 0.1717219 0.1422183 0.4786189
1981-02-18 0.2480551 0.1744767 0.1516540 0.4888569
1981-02-25 0.1963884 0.1822917 0.1537351 0.4788897

Create a pairwise plot and interpret the findings

The plot below displays the pairwise coefficients from the previous linear regressions. The coefficients for each bond category represent the unit association with Output1 during the high volatility days. The model intercept, where Output1 falls without any variable interaction, can be compared to each category coefficient.

Across the top row, there is a correlation coefficient for each bond category in relation to the intercept. Not surprisingly, two are negative associations and one is positive. I expected a mix of signs owing to collinearity, which has been reviewed in previous sections and once again appears here. The negative correlation for USSG3M is just in the weak category while the USGG30YR crosses into medium strength. The USGG5YR is weak positive but, the interpretation suffers due to collinearity. Overall, there seems to be some evidence for a negative correlation between high volatility days and downward pressure on rates. The above correlations are also captured in the first column where each slope coefficient is plotted against the corresponding intercept value for high volatility days. I’ve included regression lines to better situate the associations.

The other scatterplots show the interaction between the various bond category slope coefficients. The strongest relationship is between USGG5YR and USGG30YR, which has a strong, negative correlation (-.833). This captures how both slope coefficients move strongly down during high volatility periods. The remaining two pairs plots are less significant given they show very weak correlations. Taking the macro view though, I think intuitively these slope coefficients should move downwards during high volatility, which would be more evident if each model was done individually to avoid collinearity. That said, outside of intuition, there is still evidence here that there is a negative association between rates and high volatility periods.

coefficients <- as.data.frame(coefficients)

regression_pairs <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point() +
    geom_smooth(method = lm, color = "blue", ...)
  p
}

coefficients %>%
  ggpairs(title = "Pairwise plot for USGG3M, USGG5YR, and USGG30YR",
          lower = list(continuous = regression_pairs))

Plot the coefficients. Is the picture of coefficients consistent with the picture of pairs? If yes, explain why.

To assess this question, it’s important to keep the correlations in mind from the previous plot. I’ve developed a specific correlation plot here to re-emphasize how the respective bond category coefficients are correlated. At a high level, the takeaway here is USGG3M has weak correlation with both the other categories while USGG5YR and USGG30YR have a strong negative correlation.

corplot_df <- cor(coefficients[,-1])

corrplot.mixed(corplot_df,
               title = "High volatility linear model correlations", 
               mar = c(0, 0, 1, 0))

Continuing the pairs comparison, it’s clear here that the USGG3M (red line) has a different shape than the other two lines. This is punctuated starting around the financial crisis when the category has erratic coefficients. In other periods, like the early 2000s, the category is also noticeably out of sync with the other two lines. These relationships were relatively evident in both the bivariate scatterplot and correlation plot but, they also appear well enough here. Intuitively, this makes sense given the underlying data hasn’t changed, only the visualization medium has.

volatility_lm <- melt(coefficients,
             variable.name = "bond_category",
             value.name = "coefficients")

volatility_lm %>%
  filter(bond_category %in% c("USGG3M", "USGG5YR", "USGG30YR")) %>%
  mutate(Date = rep(as.Date(rolling_dates[,10]), 3)) %>%
  ggplot(aes(Date, coefficients, colour = bond_category)) +
  geom_line(size = 1.3, alpha = .2) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
      labs(title = "Linear coefficients for USGG3M, USGG5YR, and USGG30YR during high bond volatility periods",
       y = "Coefficients",
       x = "Date",
       caption = "Source: Course Project Treasury Data")

Getting a clearer, untangled view of the lines helps here as well. The USGG5YR and USGG30YR look broadly similar and the USGG3M is noticeably different in some parts. As such, these visualizations are concurrent with findings emanating from the pairs plot. That said, I think this is a worse format just because a direct bivariate comparison cannot be made. While the univariate trend lines help illuminate each bond category over time more effectively, they are not as well suited for comparative analysis.

volatility_lm %>%
  filter(bond_category %in% c("USGG3M", "USGG5YR", "USGG30YR")) %>%
  mutate(Date = rep(as.Date(rolling_dates[,10]), 3)) %>%
  ggplot(aes(Date, coefficients, colour = bond_category)) +
  geom_line(size = 1.3, alpha = .6, show.legend = F) +
  facet_wrap(facets = "bond_category", ncol = 1) +
  geom_hline(yintercept = 0, colour = "mediumorchid3", alpha = .8, size = 1.3) +
      labs(title = "Linear coefficients for USGG3M, USGG5YR, and USGG30YR during high bond volatility periods",
       y = "Coefficients",
       x = "Date",
       caption = "Source: Course Project Treasury Data")

As per the analysis, I’ve included the high slope spreads and jump slopes. The first, high slope spreads, highlights major differences (greater than 3) between the five and thirty-year bond slope coefficients. There are 26 such instances in the coefficients set.

coefficients %>%
  mutate(Date = as.Date(rolling_dates[,10]),
         high_slope_spread = USGG5YR - USGG30YR) %>%
  filter(high_slope_spread > 3) %>%
  select(Date, high_slope_spread) %>%
  custom_kable()
Date high_slope_spread
1982-04-26 3.253311
1982-12-15 3.026368
1985-09-16 3.097777
1987-05-12 3.026688
1987-05-19 3.542911
1987-05-27 3.450011
1987-09-25 3.103604
1988-03-15 3.065761
1988-09-27 3.112827
1988-10-05 3.122279
1989-03-10 3.701466
1992-02-05 3.216466
1994-08-03 3.029625
1994-12-08 3.602118
1996-06-14 3.376506
1997-05-09 3.054137
1997-05-16 3.063906
1997-05-23 4.009702
1997-05-30 3.197572
2000-12-26 3.768436
2001-01-02 3.293521
2001-07-25 4.175424
2001-08-01 4.308487
2003-11-13 3.244270
2004-08-12 3.176460
2004-12-16 4.199836

The jump slope highlights only one instance where USGG5YR was above 3 for the slope coefficient.

coefficients %>%
  mutate(Date = as.Date(rolling_dates[,10])) %>%
  filter(USGG5YR > 3) %>%
  select(Date, USGG5YR) %>%
  custom_kable()
Date USGG5YR
2004-12-16 3.195845

Plot the rolling R2 values. What could cause a decrease of R2?

As a next step, I’ve put together the rolling R2 values for the high volatility periods. It’s fairly standard work at this point with the change that rollapply captures the R2 here. Since I already have the rolling dates object, it’s easy to put them into a new data frame with the R2 values.

r_squared <- rollapply(bond_regression[,-1],
                       width = window_width,
                       by = window_shift,
                       by.column = FALSE,
                       FUN = function(z) summary(lm(Output1 ~ USGG3M + USGG5YR + USGG30YR, data = as.data.frame(z)))$r.squared)

r_squared <- as.data.frame(r_squared) %>%
  mutate(Date = as.Date(rolling_dates[,10])) %>%
  select(Date, r_squared)

head(r_squared, 10) %>%
  custom_kable()
Date r_squared
1981-01-16 0.9950463
1981-01-23 0.9924859
1981-01-30 0.9986412
1981-02-06 0.9988491
1981-02-17 0.9979588
1981-02-24 0.9964898
1981-03-03 0.9977975
1981-03-10 0.9989634
1981-03-17 0.9987294
1981-03-24 0.9970730

The plot highlights that the R2 values are extremely high throughout but, with some notable exceptions. While the R2 mean rests around .99, on several windows it goes as low as about .82. Why might this be happening?

r_squared %>%
  ggplot(aes(Date, r_squared)) +
  geom_jitter(alpha = .2, colour = "dodgerblue2") +
  geom_hline(yintercept = mean(r_squared$r_squared), 
             colour = "mediumorchid3", alpha = .8, size = 1.3) +
  scale_y_continuous(breaks = seq(.8, 1, .02)) +
  labs(title = "R squared for model including USGG3M, USGG5YR, and USGG30YR during high bond volatility periods",
       subtitle = "Mean R2 value (purple line) shows very high variance explanation by model",
       y = "R Squared",
       x = "Date",
       caption = "Source: Course Project Treasury Data")

To start a decrease in R2 signals that the model is not explaining as much of the total variance in the output. This would occur at time when there is a difference in the otherwise very strong correlation (given R2 is a transformed version of correlation, R). Below, the dates when R2 slips below .9 are highlighted.

r_squared %>%
  filter(r_squared < .9) %>%
  custom_kable()
Date r_squared
1987-06-24 0.8227914
1991-06-27 0.8812476
2005-04-28 0.8801402
2012-06-20 0.8988880

Volatility is the first thing that comes to mind as a cause for decreased R2. If some window volatility caused the output to fluctuate but, not in complete concert with predictors, R2 could reasonably drop. However, this doesn’t really come across in the table below. I’m obviously using a rough eye test of the volatility here but, not major differences pop out.

plot_sd %>%
  filter(Date %in% c("1987-06-25", "1991-06-28", "2005-04-29", "2012-06-21")) %>%
  group_by(bond_category) %>%
  custom_kable()
Date bond_category rolling_sd
1987-06-25 USGG3M 0.0579687
1991-06-28 USGG3M 0.0272636
2005-04-29 USGG3M 0.0425541
2012-06-21 USGG3M 0.0070446
1987-06-25 USGG5YR 0.0659105
1991-06-28 USGG5YR 0.0387902
2005-04-29 USGG5YR 0.0655362
2012-06-21 USGG5YR 0.0351445
1987-06-25 USGG30YR 0.0633173
1991-06-28 USGG30YR 0.0357539
2005-04-29 USGG30YR 0.0471661
2012-06-21 USGG30YR 0.0498544
1987-06-25 Output1 0.1277012
1991-06-28 Output1 0.0880148
2005-04-29 Output1 0.1242751
2012-06-21 Output1 0.0574567

Nor is a major volatility difference evident in the plot below which compares the median volatility across the three bond categories and Output1. Again, this is a rough comparison given I’ve combined the bond volatility and it also does not consider any covariance or slope but, volatility differences don’t seem to be a huge influence.

as.data.frame(rolling_sd) %>%
  select(USGG3M, USGG5YR, USGG30YR, Output1) %>%
  mutate(sd_apply = apply(rolling_sd[,1:3], 1, median),
         Date = plot_sd$Date[1:1656],
         low_rsquare = ifelse(Date %in% c("1987-06-25", "1991-06-28", 
                                           "2005-04-29", "2012-06-21"),
                              "low R2 day", "outside scope")) %>%
  ggplot(aes(sd_apply, Output1, colour = low_rsquare)) +
  geom_jitter() +
  labs(title = "Volatility comparison for Output1 and median combination of USGG3M, USGG5YR, and USGG30YR",
       subtitle = "Low R2 days don't seem to stick out as major volatilty difference days",
       y = "Output1 rolling sd",
       caption = "Source: Course Project Treasury Data")

I also thought plotting markers highlighting these low R2 days might be useful. I mentioned there might be days when the slope changes in bond categories and the output might not align leading to a lower R2. Each red line highlights the lowest R2 dates, which do seem to occur on steep changes for output. However, there seem to be steeper changes and more obvious disconnects between bond rates and output1 so this is inconclusive, though still interesting.

bond_trend <- treasury_data %>%
  select(-Easing, -Tightening) %>%
  melt(value.name = "interest.rate", 
       variable.name = "treasury.yield",
       id.vars = "Date")

bond_trend %>%
  ggplot(aes(Date, interest.rate, colour = treasury.yield)) +
  geom_line(size = 1.3, alpha = .4) +
  scale_y_continuous(breaks = seq(-20, 30, 5)) +
  geom_hline(yintercept = 0, colour = "royalblue2", size = 1.3, alpha = .3) +
  geom_vline(xintercept = as.Date(c("1987-06-25", "1991-06-28", 
                            "2005-04-29", "2012-06-21")),
             colour = "#EE6363", size = 1.3, linetype = "dashed",
             alpha = .75) +
  guides(colour = guide_legend(override.aes = list(size = 2))) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  labs(title = "Assessing low R2 days using time series plot of all bond classes and output",
       subtitle = "Low R2 seem to occur on days with steep changes in output; results are observational though so nothing confirmed",
       caption = "Source: Course Project Treasury Data",
       y = "interest rate (and unknown output values)")

Another idea for the lowered R2 comes from a broader extrapolation about the predictors and Output1. I’ve already shown that with a full model, the R2 is a perfect 1. With this in mind, it seems reasonable to assume that the other four bond categories explain the missing variance not included here. It could be these four bond categories do a better job of explaining the output on these days and even some volatility throws off the included three enough to diminish the R2. As such, this could possibly be a few streaks of randomness where these three happen to be lower than usual. Overall though, despite not having a perfect causal explanation, there is at least some instructive theories to work with.

Analyze the rolling p-values. Plot the values and provide an interpretation

The rolling p-values are again obtained using rollapply. These p-values are from the slope tests for each bond category and the intercept for Output1. Given this, they signify which variables are significant in the wider three term model or whether the intercept coefficient is significant.

p_values <- rollapply(bond_regression[,-1],
                       width = window_width,
                       by = window_shift,
                       by.column = FALSE,
                       FUN = function(z) summary(lm(Output1 ~ USGG3M + USGG5YR + USGG30YR, data = as.data.frame(z)))$coefficients[,4])

p_values <- as.data.frame(p_values) %>%
  mutate(Date = as.Date(rolling_dates[,10])) %>%
  rename(Intercept = "(Intercept)")

p_values %>%
  select(Date, Intercept, USGG3M, USGG5YR, USGG30YR) %>%
  head(10) %>%
  custom_kable()
Date Intercept USGG3M USGG5YR USGG30YR
1981-01-16 0 0 0.0000002 0.2538852
1981-01-23 0 0 0.0000002 0.0053009
1981-01-30 0 0 0.0000000 0.0000363
1981-02-06 0 0 0.0000000 0.0000222
1981-02-17 0 0 0.0000000 0.0001332
1981-02-24 0 0 0.0000009 0.0047335
1981-03-03 0 0 0.0000034 0.0601047
1981-03-10 0 0 0.0000168 0.3851840
1981-03-17 0 0 0.0000146 0.5025726
1981-03-24 0 0 0.0006413 0.3052705

Plotting these values highlights which predictor slope coefficients are significant. Perhaps more telling, it clearly shows which inputs have the most non-significant coefficients. At a glance, it looks like the USGG30YR term has the most non-significant p-values by a large margin. Beyond that, it appears USGG3M has the most significant coefficients. It also looks like the intercept is always significant.

At first glance, it might be surprising to have so many non-significant predictors come up in a model that is almost certainly always significant (judging by the high R2s and the linear model summaries from previous sections). Again though, the effects of multicollinearity likely influence this result.

plot_pvalue <- melt(p_values, 
                    id.vars = "Date", 
                    variable.name = "bond_category",
                    value.name = "rolling_pvalue")

plot_pvalue %>%
  ggplot(aes(Date, rolling_pvalue, colour = bond_category, shape = bond_category)) +
  geom_jitter(size = 4, alpha = .75) + 
  geom_hline(yintercept = .05, colour = "darkorange", alpha = .3, size = 1.5) +
  scale_shape_manual(values = 49:52) +
  guides(alpha = guide_legend(override.aes = list(alpha = 1))) +
  labs(title = "Slope and intercept p-values for model containing Output1 ~ USGG3M, USGG5YR, and USGG30YR",
       subtitle = "USGG30YR appears to have the most non-significant slope coefficients in rolling lm (orange line is .05 p-value threshold)",
       caption = "Source: Course Project Treasury Data")

Confirming the previous intuition stemming from the p-value plot, the table below shows that USGG30YR had 773 p-values over the significance threshold of .05 (about 47% of all samples), which was by far the most of all the predictors.

plot_pvalue %>%
  group_by(bond_category) %>%
  filter(rolling_pvalue > .05) %>%
  count() %>%
  summarise(non_significant_n = n,
            non_significant_percent = round(n / 1657 * 100)) %>%
  arrange(desc(non_significant_n)) %>%
  custom_kable()
bond_category non_significant_n non_significant_percent
USGG30YR 773 47
USGG3M 156 9
USGG5YR 32 2

The following few chunks also highlight what dates these high p-values correspond to using a .5 threshold. This includes 312 dates total, which I’ve truncated to include five samples from each bond category (for space saving purposes). All samples can be seen by removing the slice function.

plot_pvalue %>%
  group_by(bond_category) %>%
  filter(rolling_pvalue > .5) %>%
  select(Date, rolling_pvalue) %>%
  arrange(desc(rolling_pvalue)) %>%
  slice(1:5) %>%
  custom_kable()
bond_category Date rolling_pvalue
USGG3M 1992-07-15 0.9728457
USGG3M 2002-05-23 0.9558409
USGG3M 2009-10-13 0.9155319
USGG3M 2000-11-07 0.8972432
USGG3M 2011-11-23 0.8940151
USGG5YR 1987-09-11 0.9764203
USGG5YR 1987-06-24 0.9277380
USGG5YR 1982-12-01 0.9113957
USGG5YR 1987-09-03 0.8060915
USGG5YR 1999-12-03 0.7327210
USGG30YR 1999-06-25 0.9992457
USGG30YR 1994-03-16 0.9985098
USGG30YR 2001-07-11 0.9983685
USGG30YR 2007-11-16 0.9956528
USGG30YR 1996-12-06 0.9920443

Part 7: Principal Component Analysis

The final section of the project deals with Principal Component Analysis (PCA). This statistical method breaks down predictors into new variables that are orthogonal and thereby uncorrelated. Given the analysis has shown how highly correlated the bond categories are, PCA seems ideal for this data set.

Perform PCA with all the input bond categories

Beginning the analysis, I’ve created a PCA specific set that only includes the bond category predictors. PCA is not performed on the outcome variable, so Output1 is omitted.

pca_set <- treasury_data %>%
  select(USGG3M, USGG6M, USGG2YR, USGG3YR, USGG5YR, USGG10YR, USGG30YR)

c(pca_rows = dim(pca_set)[1], pca_columns = dim(pca_set)[2]) %>%
  custom_kable()
pca_rows 8300
pca_columns 7

Taking a quick look at the new data frame, the PCA set can be seen below.

head(pca_set) %>%
  custom_kable()
USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
13.52 13.09 12.289 12.28 12.294 12.152 11.672
13.58 13.16 12.429 12.31 12.214 12.112 11.672
14.50 13.90 12.929 12.78 12.614 12.382 11.892
14.76 14.00 13.099 12.95 12.684 12.352 11.912
15.20 14.30 13.539 13.28 12.884 12.572 12.132
15.22 14.23 13.179 12.94 12.714 12.452 12.082

To reaffirm the extreme correlation between predictors, I put together a pairwise plot below. As seen, the predictors are highly, and almost perfectly, correlated.

pca_set %>%
  select(USGG3M, USGG2YR, USGG5YR) %>%
  ggpairs(title = "Pairwise plot for USGG3M, USGG5YR, and USGG30YR",
          lower = list(continuous = regression_pairs))

Code for a 3D representation of this plot can be found below (although it will not produce an output in the R markdown file).

pca_set %>%
  select(USGG3M, USGG2YR, USGG5YR) %>%
  rgl.points()

Manually create a covariance matrix. Use a contour plot to visualize results

The covariance matrix is an essential component of PCA. The technique relies on getting the eigenvalues for this matrix, which make up the amount of variance each principle component explains (which is generally scaled to a percentage). At the same time, PCA factor scores are derived by multiplying the centred original data set (i.e. each column has its mean subtracted) with the resulting eigenvectors. Additionally, the eigenvector matrix comprises the PCA loadings. Given this, a necessary starting point is developing this covariance object.

While R has a built-in function to get this matrix, cov, I’ve developed my own function to illuminate how the process works. The covariance matrix is found by centring each variable in the data set, which in this case means taking the average yield rate by bond category and subtracting it from each rate value. From there, the new centred matrix is multiplied against itself. For this to work, the first matrix term has to be transposed (7 x 8300 * 8300 x 7). The resulting square, symmetric matrix (7 x 7) is the unnormalized covariance matrix. To perform the normalization, each matrix element is divided by the number of rows in the PCA set minus one (accounting for degrees of freedom). Following this, the covariance matrix is fully constructed. The function I developed, manual_covariance, follows these steps as seen in the chunk below.

manual_covariance <- function(dataset){
  centred_matrix <- as.matrix(scale(dataset, scale = F))
  covariance_matrix <- t(centred_matrix) %*% centred_matrix
  covariance_matrix <- covariance_matrix / (nrow(dataset) - 1)
  custom_kable(covariance_matrix)
}

manual_covariance(pca_set)
USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
USGG3M 11.760393 11.855287 12.303031 11.942035 11.188856 9.924865 8.587987
USGG6M 11.855287 12.000510 12.512434 12.158422 11.406959 10.128890 8.768702
USGG2YR 12.303031 12.512434 13.284203 12.977542 12.279514 11.005377 9.600181
USGG3YR 11.942035 12.158422 12.977542 12.708647 12.068078 10.856033 9.497246
USGG5YR 11.188856 11.406959 12.279514 12.068078 11.543082 10.463386 9.212159
USGG10YR 9.924865 10.128890 11.005377 10.856033 10.463386 9.583483 8.510632
USGG30YR 8.587987 8.768702 9.600181 9.497246 9.212159 8.510632 7.624304

To confirm if the new function returns the correct matrix, I’ve check it against the cov function. As seen, the results are the same.

covariance_matrix <- as.data.frame(cov(pca_set))

covariance_matrix  %>%
  custom_kable()
USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
USGG3M 11.760393 11.855287 12.303031 11.942035 11.188856 9.924865 8.587987
USGG6M 11.855287 12.000510 12.512434 12.158422 11.406959 10.128890 8.768702
USGG2YR 12.303031 12.512434 13.284203 12.977542 12.279514 11.005377 9.600181
USGG3YR 11.942035 12.158422 12.977542 12.708647 12.068078 10.856033 9.497246
USGG5YR 11.188856 11.406959 12.279514 12.068078 11.543082 10.463386 9.212159
USGG10YR 9.924865 10.128890 11.005377 10.856033 10.463386 9.583483 8.510632
USGG30YR 8.587987 8.768702 9.600181 9.497246 9.212159 8.510632 7.624304

The covariance matrix can be visualized using a slightly modified contour plot. Normally, a contour plot requires a numeric x and y-axis. To relax this constraint, each bond category can be turned into a number, with USGG3M and USGG6M becoming decimals (.25 and .5 for a quarter and half year). From there, the plot can be constructed.

I’ve added in dots to signal where each bond class value meets with another in the covariance matrix. For example, the USGG30YR and USGG30YR point in the matrix is 7.624, which is labelled in the plot’s far right corner. A contour plot normally has a third variable to interpret (z) but, since the both the x and y-axis are the same values, there isn’t any actually density to review. That said, it’s still an effective way to visualize the covariance matrix.

contour_plot <- covariance_matrix %>%
  mutate(bond = c(.25, .5, 2, 3, 5, 10, 30))

contour_plot <- melt(contour_plot, id.vars = "bond")

contour_plot %>%
  mutate(value = round(value, 3),
         bond = as.numeric(bond),
         y = rep(c(.25, .5, 2, 3, 5, 10, 30), each = 7),
         label_marker = ifelse(value < 8 | value == 9.583 | value == 11.543, T, F )) %>%
  ggplot(aes(bond, y, z = value, label = ifelse(label_marker, value, NA))) +
  geom_contour(size = 1.3) +
  geom_point(size = 2, colour = "darkorange") +
  geom_label() +
  scale_x_continuous(breaks = seq(0, 30, 5)) +
  scale_y_continuous(breaks = seq(0, 30, 5)) +
    labs(title = "Contour plot for bond categories",
       subtitle = "Each orange dot represents the covariance intersect of a bond category",
       x = "bond category",
       y = "bond category",
       caption = "Source: Course Project Treasury Data")

Perform the PCA by manually calculating factors, loadings. Find eigenvalues and eigenvectors. Calculate the vector of means (zero loading) and the first 3 loadings and factors.

I’ve already described this process but, I’ll briefly reiterate it again. To derive the principle component factors, loadings, and factor scores, a series of matrix multiplication operations take place. This starts with a standardized matrix, which is the PCA set with each column mean subtracted (Y naught). This is used to find a covariance matrix between all the bond category yield rates. Thereafter, the covariance matrix is used to find eigenvalues and vectors. The decomposition is possible here because the covariance matrix has full rank and is symmetric. I’ve also included the factor matrix (F) which is the centred matrix multiplied by the eigenvector matrix, which produces factor scores.

The initial calculations include all seven factors, which I’ll reduce to factors one, two, and three given they are the analysis focus.

centred_matrix <- as.matrix(scale(pca_set, scale = F))

means_vector <- colMeans(pca_set)

eigen_values <- eigen(covariance_matrix)$values

eigen_vectors <- eigen(covariance_matrix)$vectors

pca_factors <-  data.frame(
  Date = treasury_data$Date,
  centred_matrix %*% eigen_vectors) %>%
  select(Date, X1, X2, X3) %>%
  rename(F1 = X1,
         F2 = X2,
         F3 = X3)

As always with a manual process, it’s good to check if everything went smoothly and the correct values were derived. To set up a comparison, I’ll use the princomp function which is used to automate PCA. I’ll check the eigenvector matrix, which are the loadings, against the one I developed. A comparison of the charts shows that they are the same, which validates the process. I’ll provide a more robust comparison in a later section but, this is a good starting point since it validates the values I derived manually. With this in mind, I’ll use my variables going forward.

bond_pca <- princomp(pca_set)

bond_pca$loadings[,1:7] %>% 
  custom_kable()
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
USGG3M -0.3839609 0.5074451 0.5298222 -0.4037350 0.3860878 -0.0397629 0.0267425
USGG6M -0.3901870 0.4394614 0.1114737 0.4052645 -0.6787624 0.0947545 -0.0909135
USGG2YR -0.4151851 0.1111272 -0.4187873 0.4089695 0.3787209 -0.2984864 0.4900099
USGG3YR -0.4063541 -0.0169699 -0.4476561 -0.0643375 0.2362448 0.1976003 -0.7315706
USGG5YR -0.3860610 -0.2314032 -0.2462364 -0.5335766 -0.2868460 0.4212577 0.4385596
USGG10YR -0.3477544 -0.4324598 0.1500903 -0.1985654 -0.2562426 -0.7356186 -0.1526275
USGG30YR -0.3047124 -0.5442123 0.4979195 0.4209884 0.2074508 0.3777669 0.0091998
rownames(eigen_vectors) <- bond_categories

colnames(eigen_vectors) <- c("Comp.1", "Comp.2", "Comp.3", "Comp.4", "Comp.5",
                             "Comp.6", "Comp.7")

eigen_vectors %>%
  custom_kable()
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
USGG3M -0.3839609 -0.5074451 0.5298222 0.4037350 0.3860878 -0.0397629 0.0267425
USGG6M -0.3901870 -0.4394614 0.1114737 -0.4052645 -0.6787624 0.0947545 -0.0909135
USGG2YR -0.4151851 -0.1111272 -0.4187873 -0.4089695 0.3787209 -0.2984864 0.4900099
USGG3YR -0.4063541 0.0169699 -0.4476561 0.0643375 0.2362448 0.1976003 -0.7315706
USGG5YR -0.3860610 0.2314032 -0.2462364 0.5335766 -0.2868460 0.4212577 0.4385596
USGG10YR -0.3477544 0.4324598 0.1500903 0.1985654 -0.2562426 -0.7356186 -0.1526275
USGG30YR -0.3047124 0.5442123 0.4979195 -0.4209884 0.2074508 0.3777669 0.0091998

Analyze the importance for PCA factors and loadings

With the PCA process confirmed, I’ll work towards conducting the factor and loadings review. Here, the real power of conducting a PCA analysis will become clear. To begin, there are seven factors to review, which are the eigenvalues from the previous step.

Eigenvalues here are a proxy for how much variance the analysis captures. These values, or factors, are essentially all the information from the original data set distilled into 7 values. They come ordered so the largest factor is first with decreasing importance thereafter. Each can be reviewed for how much variance it explains by dividing the eigenvalue by the sum of all factors.

The variance explanations can then be charted, as seen below. Factor one captures about 98% of information from all the bond categories. This is fairly astounding but, given how highly correlated each predictor is, not totally unexpected. The power of PCA can be seen firsthand here: In essence, an entire set was compressed into seven factors, the first of which explains about 98% of the entire variance, or information, found in the data. That means that the remaining 2% is spread between the other six factors and, even then, mostly in factor two. As a whole, factors one and two more or less describe all seven predictors. The remaining five have very little information contained in them but, do contribute some marginal variance explanation.

as.data.frame(eigen_values) %>%
  mutate(factors = c("F1", "F2", "F3", "F4", "F5", "F6", "F7"),
         variance_explained = round(eigen_values / sum(eigen_values), 2),
         factors = reorder(factors, variance_explained)) %>%
  ggplot(aes(factors, variance_explained, 
             label = paste0(variance_explained * 100,"%"))) +
  geom_col(fill = "royalblue2") +
  coord_flip() +
  geom_label(hjust = .3, nudge_x = 0.05) +
  labs(title = "PCA variance plot- Factor 1 provides most variance explanation (98%)",
       caption = "Source: Course Project Treasury Data")

Interpret the factors by looking at the shapes of the loadings

Next are the loadings. As aforementioned, the focus here is on factors one through three, So I’ve created a data frame with just those for analysis. The faceted plot depicts the original loadings alongside a transformed version where F1 has been multiplied by -1 (which switches the value’s sign).

Starting with F1 (red line), it’s clear that the factor is evenly represented across all bond categories, save for some minor fluctuations. F2 (blue line) displays heavy negative weightings on some of the short-term bonds, such as USGG3M and USGG6M, but slowly goes up in long-term bonds. F3 (blue line), forms a “U” shape with higher loadings in short and long-term bonds and decreasing values in the middle-term categories. F2 and F3 are fairly erratic and cross zero whereas F1 is much more stable.

That more or less just describes their movement and shape though. Focusing on a wider interpretation here, there is further meaning to be described. Starting at USGG3M, F3 is the highest loading value with F1 and F2 much further down. As such, F3 can be thought of as capturing the most information from the bond class. Granted, this is probably noisy information given the factor only accounts for a small part of the data set information but, this at least adds an intuitive dimension to the analysis. F1 captures most categories evenly, which again makes sense given it accounts for 98% of the data’s variance.

There’s still a lesson here though: The lower variance factors do capture information on specific bond classes that F1 might not fully account for. F2 seems to have higher loadings than F1 and F3 for USGG3YR; while this might be noise, there could be some valuable outlier information here. In any case, thinking about the nuance of each of the factor loadings help get the most out of the PCA analysis.

As a final macro point, all the loadings can be used to describe the day-to-day movement of the bond yields. Since the first factor captures the most information, it’s movement can be used describe most yield movement. Since the loadings don’t change, they become a marker for rate change and could be applied on any day to find the daily yields in combination with other values.

pca_loadings <- data.frame(
  eigen(covariance_matrix)$vectors) %>%
  select(X1, X2, X3) %>%
  rename(F1 = X1,
         F2 = X2,
         F3 = X3)

pca_loadings <- melt(pca_loadings,
                     variable.name = "factor",
                     value.name = "pca_loadings")

pca_loadings <- pca_loadings %>%
  mutate(pca_loadings = ifelse(factor == "F1", pca_loadings * -1, pca_loadings)) %>%
  bind_rows(pca_loadings) %>%
  mutate(loading_class = ifelse(seq(pca_loadings) == 1:21, "F1 * -1", "original"),
         loading_class = factor(loading_class, levels = c("original", "F1 * -1"))) %>%
  select(factor, loading_class, pca_loadings)

pca_loadings %>%
  mutate(bond_category = rep(factor(bond_categories, levels = bond_categories), 6)) %>%
  ggplot(aes(bond_category, pca_loadings, colour = factor, group = factor)) +
  geom_line(size = 1.5) +
  facet_wrap(facets = "loading_class", nrow = 2) +
  scale_y_continuous(breaks = seq(-1, 1, .3)) +
    labs(title = "PCA loadings by factor across all bond categories",
         subtitle = "F1 even across all bonds, F2 shows negative loadings for short-term classes, F3 is low for medium-term bonds (original interpretations)",
       caption = "Source: Course Project Treasury Data")

When assessing the factor scores, a familiar line is evident. In the top facet, factor one has been multiplied by -1 changing the score signs. Following this transformation, the line shows a strong resemblance to Output1. I’ll come back to this in more depth shortly but, the output mystery is finally uncovered here.

In terms of interpretation, since factor one captures almost all of the bond category movement history, it’s movement can be reviewed as a history of yield rate changes since 1981. In the early 1980s, all bond classes had very high yields relative to the rest of the samples. During this time, the F1 scores are at their highest. All of the previous analyses can be seen here too, such as the easing and tightening periods. Additionally, the big financial events, such as the 2008 are evident as well. To offer a colloquial summation: As F1 goes, so too does the federal reserve bond yield rates.

While F2 and F3 are fairly stable in comparison, they capture different bond information and can be thought of as sequential histories as well. Additionally, factor two captures noise that might be useful and not covered by F1. For example, around sample 2800, F2 rises slightly at a time when F1 is decreasing. Places where the lines are incongruent might offer insights that F1 omits but, is captured in another factor. Each factor also has a connection to the rates, though not nearly as strong or defined as F1.

factor_scores <- melt(pca_factors[,-1],
                     variable.name = "factor",
                     value.name = "factor_scores")

factor_scores <- factor_scores %>%
  mutate(factor_scores = ifelse(factor == "F1", factor_scores * -1, factor_scores)) %>%
  bind_rows(factor_scores)

factor_scores %>%
  mutate(loading_class = rep(c("F1 * -1", "original"), each = 24900),
         loading_class = factor(loading_class, levels = c("F1 * -1", "original")),
         index = rep(seq(pca_factors$F1), 6)) %>%
  ggplot(aes(index, factor_scores, colour = factor, group = factor)) +
  scale_x_continuous(breaks = seq(0, 8500, 1000)) +
  geom_line(size = 1.5) +
  facet_wrap(facets = "loading_class", nrow = 2, scales = "free_y") +
  labs(title = "PCA factor scores across all samples",
       caption = "Source: Course Project Treasury Data")

Plot Factor 1 against Factor 2. Draw at least three conclusions from the plot.

To start, the shape of the plot reveals the orthogonal nature of each factor. By definition, these are linearly independent and as such, do not have any overlapping information. The result of this is a correlation of zero between F1 and F2 (1). The shape also highlights that F1 is leading F2, given the direction of the plot and the interaction of the points. This makes more sense from a time series perspective, which the underlying plot also contains (2). To make this clearer, I’ve added in a colour for each decade in the sample so the history of the two factors changing can be better reviewed.

The spacing of points across each decade are revealing as well. For example, the spacing is more distant and sweeping in the 1980s, a period noted by large swings in day-to-day yield rates. In contrast, the factor scores in the 2010s are spaced much closer together. When reviewing the yield rates during that time, they appear very stable without any major shocks (3). Given this, this bivariate F1-F2 plot, much like the faceted factor score visualization, provides insight into the history of rate changes since 1981 (4). To highlight this, and to assist with audience comprehension, I’ve included date markers to help interpret the plot. Since it moves from right to left, which is traditionally not how time series are viewed, date markers help guide the viewer.

pca_factors <- pca_factors %>%
  mutate(F1 = F1 * -1,
         decade.marker = substr(as.character(Date), 1, 3),
         decade = case_when(
           decade.marker == "198" ~ "1980s",
           decade.marker == "199" ~ "1990s",
           decade.marker == "200" ~ "2000s",
           decade.marker == "201" ~ "2010s")) %>%
  select(Date, decade, F1, F2, F3)

pca_factors %>%
  mutate(date_indicate =  case_when(
           Date == "1981-01-05" ~ "Start here: 1981",
           Date == "2014-06-26" ~ "End here: 2014")) %>%
  ggplot(aes(F1, F2, colour = decade, label = date_indicate)) +
  geom_point(size = 3, alpha = .3) +
  geom_label(hjust = .3, nudge_x = -5.5, nudge_y = -.5, fontface = "bold") +
  scale_y_continuous(breaks = seq(-5, 5, 1)) +
      labs(title = "PCA factor 1 vs 2 over time from 1981 to 2014 (right to left)",
           subtitle = "Factor spacings indicate different yield volatilty across decades, factors are orthogonal (R = 0), F1 leading F2",
       caption = "Source: Course Project Treasury Data")

Analyze the adjustments that each factor makes to the term curve. Explain how loading shapes affect the adjustments using only factor 1, factors 1 and 2, and all 3 factors.

The factor adjustments shape the yield term curves in different ways. I’ve included a faceted plot to compare both term curves with individual and combined factor influence. This comparison allows for an easier visual comparison of how each factor influences the curve. Overall, F1 provides a parallel shift for the line, F2 provides a slope change, and F3 affects the middle curvature of the term curve.

pca_loadings <- data.frame(
  eigen(covariance_matrix)$vectors) %>%
  select(X1, X2, X3) %>%
  mutate(X1 = X1 * -1)

old_curve <- treasury_data[135,2:8]

new_curve <- treasury_data[136,2:8]

factor_change <- pca_factors[136,3:5] - pca_factors[135,3:5]

curve_change <- old_curve - new_curve

F1_adj <- old_curve + (t(pca_loadings[,1]) * as.numeric(factor_change[1]))

F2_adj <- old_curve + (t(pca_loadings[,1]) * as.numeric(factor_change[1])) +
  (t(pca_loadings[,2]) *  as.numeric(factor_change[2]))

F3_adj <-  old_curve + (t(pca_loadings[,1]) * as.numeric(factor_change[1])) +
  (t(pca_loadings[,2]) *  as.numeric(factor_change[2])) +
  (t(pca_loadings[,3]) *  as.numeric(factor_change[3]))

F2 <- old_curve + (t(pca_loadings[,2]) *  as.numeric(factor_change[2]))

F3 <-  old_curve + (t(pca_loadings[,3]) *  as.numeric(factor_change[3]))
 
curve_adjustment <- bind_rows(old_curve, new_curve, F1_adj, F2_adj, F3_adj,
                              old_curve, new_curve, F1_adj, F2, F3)

curve_adjustment <- melt(curve_adjustment, 
             variable.name = "bond_category", 
             value.name = "curve_adjustment")

curves <- c("Old Curve", "New Curve", "1-Factor Adj.", 
                       "2-Factor Adj.", "3-Factor Adj.")

adjustment <- c("All factors", "Individual factors")

curve_adjustment <- curve_adjustment %>%
  mutate(adjustment = factor(rep(adjustment, each = 5, 7), levels = adjustment),
         curve = factor(rep(curves, 14), levels = curves))

curve_adjustment %>%
  ggplot(aes(bond_category, curve_adjustment, colour = curve, group = curve)) +
  geom_line(size = 1.5, alpha = .4) +
  facet_wrap(facets = "adjustment", nrow = 2) +
  scale_y_continuous(breaks = seq(10, 18, .5)) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  labs(title = "Yields rates by category with adjusted term curves for individual and combined factors",
       subtitle = "F1 provides parallel shift, F2 provides change of slope, F3 changes middle curvature",
       caption = "Source: Course Project Treasury Data")

Explore the goodness of fit for the 10Y yield

The USGG10YR approximations are very close to the actual values. These are derived by using the mean value for USGG10YR (taken from the vector of means) in combination with the first three loadings for the bond category multiplied by all three factors.

ten_year <- data.frame(
  pca_approx = means_vector[6] +
  pca_loadings[6,1] * 
  pca_factors[,3] + 
  pca_loadings[6,2] *
  pca_factors[,4] + 
  pca_loadings[6,3] * 
  pca_factors[,5],
  USGG10YR = treasury_data$USGG10YR
)

ten_year %>%
  slice(1:10) %>%
  custom_kable()
pca_approx USGG10YR
11.99677 12.152
11.97187 12.112
12.25286 12.382
12.27920 12.352
12.51041 12.572
12.37916 12.452
12.41123 12.532
12.37217 12.532
12.53218 12.622
12.41604 12.532

Two metrics provide insight into the goodness of fit. Mean absolute error (MAE) provides insight on how large the average miss between the approximation and actual values are. It indicates the average miss was only about .038. Mean absolute percentage error (MAPE) is very similar but, provides a percentage the approximation was off from the actual. Here, the approximations were, on average, about .757% off from the actual. Again, this points to a very close fit between the approximations and actuals.

ten_year %>%
  summarise(MAE = round(mean(abs(pca_approx - USGG10YR)), 3),
            MAPE = round(mean(MAE / USGG10YR * 100), 3)) %>%
  custom_kable()
MAE MAPE
0.038 0.757

The plot of both lines is, unsurprisingly, indicative of a very close fit as well. In fact, both lines look nearly identical, which expected given the average miss is so low.

ten_year <- melt(ten_year, 
             variable.name = "value_composition",
             value.name = "USGG10YR")

ten_year %>%
  mutate(index = rep(seq(treasury_data$USGG10YR), 2)) %>%
  ggplot(aes(index, USGG10YR, colour = value_composition)) +
  geom_line(aes(size = value_composition), alpha = .65) +
  scale_size_manual(values = c(2, 1)) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  scale_colour_manual(values = c("darkorange", "royalblue2")) +
  labs(title = "Actual USGG10YR yield vs PCA approximated numbers- values are very closely aligned",
       caption = "Source: Course Project Treasury Data")

Repeat the PCA using princomp function

I began this process already simply because I wanted to vet my manual values before using them in this section of the analysis. However, I only provided a cursory check so I’ll do a more thorough review here. I’ve already developed the PCA object for the predictors so in the first chunk, I’m just seeing what names are associated with it.

c(pca_names = names(bond_pca)) %>%
  custom_kable()
pca_names1 sdev
pca_names2 loadings
pca_names3 center
pca_names4 scale
pca_names5 n.obs
pca_names6 scores
pca_names7 call

The eigenvector matrix and loadings were done in my first comparison but, to reiterate here using the first three factors, the values are the same, save for sign differences in factor two.

data.frame(
  bond_pca$loadings[,1:3],
  eigen_vectors[,1:3]) %>%
  custom_kable()
Comp.1 Comp.2 Comp.3 Comp.1.1 Comp.2.1 Comp.3.1
USGG3M -0.3839609 0.5074451 0.5298222 -0.3839609 -0.5074451 0.5298222
USGG6M -0.3901870 0.4394614 0.1114737 -0.3901870 -0.4394614 0.1114737
USGG2YR -0.4151851 0.1111272 -0.4187873 -0.4151851 -0.1111272 -0.4187873
USGG3YR -0.4063541 -0.0169699 -0.4476561 -0.4063541 0.0169699 -0.4476561
USGG5YR -0.3860610 -0.2314032 -0.2462364 -0.3860610 0.2314032 -0.2462364
USGG10YR -0.3477544 -0.4324598 0.1500903 -0.3477544 0.4324598 0.1500903
USGG30YR -0.3047124 -0.5442123 0.4979195 -0.3047124 0.5442123 0.4979195

In the same vein, the loadings plot is the same as well.

pca_loadings <- melt(bond_pca$loadings[1:7,1:3],
                     variable.name = "factor",
                     value.name = "pca_loadings")

pca_loadings <- pca_loadings %>%
  mutate(pca_loadings = ifelse(Var2 == "Comp.1", pca_loadings * -1, pca_loadings)) %>%
  bind_rows(pca_loadings) %>%
  mutate(loading_class = ifelse(seq(pca_loadings) == 1:21, "F1 * -1", "princomp"),
         loading_class = factor(loading_class, levels = c("princomp", "F1 * -1"))) %>%
  rename(factor = Var2,
         bond_category = Var1) %>%
  select(bond_category, factor, loading_class, pca_loadings)

pca_loadings %>%
  ggplot(aes(bond_category, pca_loadings, colour = factor, group = factor)) +
  geom_line(size = 1.5) +
  facet_wrap(facets = "loading_class", nrow = 2) +
  scale_y_continuous(breaks = seq(-1, 1, .3)) +
    labs(title = "PCA loadings by factor across all bond categories using princomp",
       caption = "Source: Course Project Treasury Data")

This also holds true for the factor scores plot.

factor_scores <- melt(bond_pca$scores[,1:3],
                     variable.name = "factor",
                     value.name = "factor_scores")

factor_scores <- factor_scores %>%
  mutate(factor_scores = ifelse(Var2 == "Comp.1", factor_scores * -1, factor_scores)) %>%
  bind_rows(factor_scores) %>%
  select(-Var1)

factor_scores %>%
  rename(factor = Var2) %>%
  mutate(loading_class = rep(c("F1 * -1", "princomp"), each = 24900),
         loading_class = factor(loading_class, levels = c("F1 * -1", "princomp")),
         index = rep(seq(pca_factors$F1), 6)) %>%
  ggplot(aes(index, factor_scores, colour = factor, group = factor)) +
  scale_x_continuous(breaks = seq(0, 8500, 1000)) +
  geom_line(size = 1.5) +
  facet_wrap(facets = "loading_class", nrow = 2, scales = "free_y") +
  labs(title = "PCA factor scores across all samples from princomp",
       caption = "Source: Course Project Treasury Data")

What is Output1?

Put simply, Output1 is the factor scores from F1 as derived from the PCA analysis. This means that it is essentially a meta-feature of information compiled from all the original bond category data, which is a linear combination of the predictors. Output1 captures about 98% of the information from all the yield rates making it a very good indicator of how all the bond categories move in one condensed variable. Given all the yields are so closely correlated, PCA is a very reasonable approach here. From a practical point of view, this would be an excellent addition to any financially based analyses where the project called for using the treasury yields because it captures so much rate information while also providing dimensional reduction. That said, the models developed throughout the analysis aren’t really useful for wider predictions since they gauge the predictors relationship to themselves.

output_uncovered <- data.frame(
  value_composition = rep(c("Output1 from data", "Manual Factor 1", "Princomp F1"),
                          each = 8300),
  values = as.vector(c(treasury_data$Output1, pca_factors[,3], bond_pca$scores[,1] * -1))
)

output_uncovered  %>%
  mutate(index = rep(seq(treasury_data$Output1), 3)) %>%
  ggplot(aes(index, values, colour = value_composition)) +
  geom_line(aes(size = value_composition), alpha = .65) +
  scale_size_manual(values = c(3, 2, 1)) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  scale_colour_manual(values = c("gray", "darkorange", "royalblue2")) +
  labs(title = "Output1, manual factor 1 scores, and princomp factor 1 scores- plot shows perfect fit between all three groups",
       subtitle = "Output1 is the factor scores from F1 as derived from the PCA analysis",
       caption = "Source: Course Project Treasury Data")

Compare the regression coefficients from Step 2 and Step 3 with factor loadings

This table clearly highlights that the slope and intercept coefficients from the switched regression (with Output1 as the predictor and each bond category as the outcome variable) equal to the PCA loadings and centres (variable means).

data.frame(
  t(apply(treasury_data[,2:8], 2, 
        function(x) summary(lm(x ~ Output1, treasury_data))$coefficients[1:2])),
  pca_centres = bond_pca$center, 
  pca_loadings = bond_pca$loadings[,1] * -1) %>%
  rename(regression_intercept = X1,
         regression_slope = X2) %>%
  custom_kable()
regression_intercept regression_slope pca_centres pca_loadings
USGG3M 4.675134 0.3839609 4.675134 0.3839609
USGG6M 4.844369 0.3901870 4.844369 0.3901870
USGG2YR 5.438888 0.4151851 5.438888 0.4151851
USGG3YR 5.644458 0.4063541 5.644458 0.4063541
USGG5YR 6.009421 0.3860610 6.009421 0.3860610
USGG10YR 6.481316 0.3477544 6.481316 0.3477544
USGG30YR 6.869355 0.3047124 6.869355 0.3047124

Is there a correspondence between the coefficients of models Output1~Yield and the first loading?

Using the centred matrix I constructed earlier, this line of inquiry can be reviewed.

c(dim = dim(centred_matrix)) %>%
  custom_kable()
dim1 8300
dim2 7

The same process as before can be applied, which shows that the original model (Output1 ~ predictors) has slope coefficients equal to the slope coefficients of the centred matrix linear model.

centred_matrix <- as.data.frame(centred_matrix) %>%
  mutate(Output1 = treasury_data$Output1)

t(data.frame(
  original_lm = t(apply(treasury_data[,2:8], 2, 
        function(x) summary(lm(Output1 ~ x, treasury_data))$coefficients[2])),
  centred_lm = t(apply(centred_matrix, 2, 
        function(x) summary(lm(Output1 ~ x, centred_matrix))$coefficients[2]))))%>%
  custom_kable()
original_lm.USGG3M 2.507561
original_lm.USGG6M 2.497235
original_lm.USGG2YR 2.400449
original_lm.USGG3YR 2.455793
original_lm.USGG5YR 2.568742
original_lm.USGG10YR 2.786991
original_lm.USGG30YR 3.069561
centred_lm.USGG3M 2.507561
centred_lm.USGG6M 2.497235
centred_lm.USGG2YR 2.400449
centred_lm.USGG3YR 2.455793
centred_lm.USGG5YR 2.568742
centred_lm.USGG10YR 2.786991
centred_lm.USGG30YR 3.069561
centred_lm.Output1 1.000000

Building on this, the centred matrix slope coefficients are equal to the first row PCA loadings.

data.frame(
  loadings = bond_pca$loadings[,1] * -1,
  from_model = t(lm(Output1 ~ ., data = centred_matrix)$coef)[-1]) %>%
  custom_kable()
loadings from_model
USGG3M 0.3839609 0.3839609
USGG6M 0.3901870 0.3901870
USGG2YR 0.4151851 0.4151851
USGG3YR 0.4063541 0.4063541
USGG5YR 0.3860610 0.3860610
USGG10YR 0.3477544 0.3477544
USGG30YR 0.3047124 0.3047124

References: